Sentiment Analysis¶

Why this matters¶

Sentiment analysis turns text into a prediction about attitude: positive, negative, or sometimes neutral. Businesses use it for reviews and customer feedback. Social scientists use it for opinions and public discourse. Product teams use it to summarize complaints and praise.

The notebook builds a classic machine-learning sentiment pipeline on IMDb movie reviews:

raw review text -> cleaned text -> numerical features -> classifier -> sentiment prediction

This lesson focuses on that pipeline. The details matter because text is not naturally numerical, and machine-learning models need numbers.

Mental model¶

A sentiment classifier does not "read" a review like a human. In the classic approach, it learns statistical associations between words and labels.

Example:

"excellent, moving, unforgettable" -> likely positive
"boring, awful, waste" -> likely negative

The model sees many labeled examples, converts each review into a feature vector, and learns which patterns tend to indicate positive or negative sentiment.

Core ideas¶

Sentiment analysis is supervised text classification when labels are available.
Text must be converted into numerical features before a traditional ML model can use it.
Bag-of-words represents documents by token counts, ignoring word order.
TF-IDF downweights words that appear in many documents and are therefore less discriminative.
Text preprocessing removes or normalizes noise such as HTML, punctuation, casing, and sometimes stop words.
Tokenization splits text into units such as words.
Stemming maps related word forms to a shared root, but can also distort words.
Pipelines combine vectorization and classification into one trainable workflow.
Grid search compares preprocessing and model hyperparameters.
Out-of-core learning trains incrementally when the dataset is too large to fit comfortably in memory.
Topic modeling is related, but it is unsupervised and tries to discover themes, not sentiment labels.

Walkthrough¶

The IMDb task¶

The dataset contains 50,000 movie reviews:

positive reviews: IMDb rating greater than 6
negative reviews: IMDb rating less than 5

The goal is to train a model that predicts whether a new review is positive or negative.

The notebook first assembles individual text files into a CSV with two columns:

review, sentiment

where sentiment is 1 for positive and 0 for negative.

Why text needs vectorization¶

Models such as logistic regression cannot directly use raw text.

This is not a valid model input:

"This movie was surprisingly good."

The text must become numbers:

[0, 2, 0, 1, 0, 0, 1, ...]

Each position in the vector corresponds to a token in the vocabulary.

Bag-of-words¶

Bag-of-words has two steps:

Build a vocabulary of unique tokens.
Count how often each token appears in each document.

For these documents:

The sun is shining
The weather is sweet

the vocabulary might be:

is, shining, sun, sweet, the, weather

Each document becomes a count vector over that vocabulary.

The trade-off is simple:

bag-of-words is easy and effective
it ignores word order

So:

not good

can be hard to distinguish from:

good

unless the model uses n-grams or other features.

TF-IDF¶

Raw word counts can overvalue common words. Words such as is, the, and movie may occur often without helping much.

TF-IDF adjusts word counts using this idea:

A word is more useful when it appears often in this document but not in almost every document.

So a word like excellent may carry more signal than the, even if the appears more often.

The practical scikit-learn shortcut is TfidfVectorizer, which combines token counting and TF-IDF weighting.

Cleaning text¶

IMDb reviews contain HTML markup, punctuation, capitalization, and emoticons.

A simple cleaner can:

remove HTML tags
lowercase text
remove many non-word characters
preserve emoticons because they can carry sentiment

Example:

import re


def preprocessor(text):
    text = re.sub(r"<[^>]*>", "", text)
    emoticons = re.findall(r"(?::|;|=)(?:-)?(?:\)|\(|D|P)", text)
    text = re.sub(r"[\W]+", " ", text.lower())
    text = text + " " + " ".join(emoticons).replace("-", "")
    return text.strip()

This is a teaching cleaner, not a universal NLP cleaner. In real projects, preprocessing choices should match the data and task.

Tokenization, stemming, and stop words¶

Tokenization splits text into tokens:

def tokenizer(text):
    return text.split()

Stemming reduces related words:

runners, running, runs -> run

This can reduce vocabulary size, but it can also remove useful nuance or produce unnatural stems.

Stop-word removal removes common words such as is, and, the. It can help with raw counts, but TF-IDF already downweights frequent words. Removing stop words is not always a win.

The notebook's grid search compares some of these choices instead of assuming one is always best.

Train a classifier¶

The main classifier is logistic regression. Despite the name, logistic regression is a classification model.

The pipeline is:

TfidfVectorizer -> LogisticRegression

That means:

Convert text reviews into TF-IDF feature vectors.
Train logistic regression to predict sentiment labels.

The notebook reports roughly 90% accuracy with the tuned model on IMDb reviews.

Why use a pipeline?¶

A scikit-learn pipeline keeps preprocessing and modeling together.

This matters because the vectorizer must be fit only on training data during evaluation. If the vectorizer sees the test set before evaluation, information leaks from test to train.

The pipeline helps enforce the right workflow.

Grid search¶

Grid search tries different combinations:

tokenizer vs stemmed tokenizer
stop words vs no stop words
TF-IDF vs raw counts
L1 vs L2 regularization
different regularization strengths

The benefit is systematic comparison. The cost is time. Text vectorization over 50,000 reviews can be expensive, and grid search trains many models.

Out-of-core learning¶

Sometimes the dataset is too large to hold in memory or vectorize all at once.

Out-of-core learning streams examples in batches and updates the model incrementally.

The notebook uses:

HashingVectorizer: converts tokens into a fixed-size feature space without storing a vocabulary
SGDClassifier: supports incremental learning with partial_fit

The trade-off:

much more memory efficient
usually less interpretable and sometimes slightly less accurate

The notebook later introduces Latent Dirichlet Allocation, also called LDA. This is not Linear Discriminant Analysis from the dimensionality reduction lesson.

Here, LDA means topic modeling:

discover hidden groups of words that tend to appear together across documents.

Sentiment analysis is supervised classification. Topic modeling is unsupervised structure discovery.

Explained code examples¶

Classic TF-IDF sentiment pipeline¶

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

model = Pipeline([
    ("vect", TfidfVectorizer(preprocessor=preprocessor)),
    ("clf", LogisticRegression(max_iter=1000)),
])

model.fit(X_train, y_train)
accuracy = model.score(X_test, y_test)

What this teaches:

TfidfVectorizer turns text into numerical features
LogisticRegression learns the positive/negative decision boundary
Pipeline keeps vectorization and classification together
score evaluates on held-out test data

Grid search shape¶

from sklearn.model_selection import GridSearchCV

param_grid = {
    "vect__ngram_range": [(1, 1), (1, 2)],
    "vect__stop_words": [None, "english"],
    "clf__C": [1.0, 10.0],
}

search = GridSearchCV(
    model,
    param_grid=param_grid,
    scoring="accuracy",
    cv=5,
    n_jobs=-1,
)

search.fit(X_train, y_train)

What this teaches:

vect__... parameters belong to the vectorizer step
clf__... parameters belong to the classifier step
cross-validation estimates performance for each setting
the search can be slow because it trains many pipelines

Out-of-core sketch¶

from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier

vect = HashingVectorizer(
    decode_error="ignore",
    n_features=2**21,
    preprocessor=None,
    tokenizer=tokenizer,
)

clf = SGDClassifier(loss="log_loss")

for X_batch, y_batch in stream_batches("Data/movie_data.csv"):
    X_batch = vect.transform(X_batch)
    clf.partial_fit(X_batch, y_batch, classes=[0, 1])

What this teaches:

HashingVectorizer avoids storing the full vocabulary
partial_fit updates the model batch by batch
this is useful when data is too large for a normal in-memory workflow

Common traps¶

Sentiment analysis understands emotion like a person.

A classic model learns statistical patterns in labeled text. It can work well but still miss sarcasm, context, and domain-specific meaning.

Bag-of-words preserves sentence meaning.

Bag-of-words ignores word order. That is simple and powerful, but it loses syntax.

More preprocessing is always better.

Stemming and stop-word removal can help or hurt. Test them instead of assuming.

TF-IDF is magic.

TF-IDF mostly reweights terms so common words count less and document-specific words count more.

Grid search gives the true best model.

It gives the best model among the parameter choices you tried, under the validation setup you used.

High accuracy means the model is ready for every review site.

IMDb movie reviews are one domain. Customer support tickets, tweets, or product reviews may need new evaluation and possibly retraining.

Topic modeling is sentiment analysis.

Topic modeling discovers themes without labels. Sentiment classification predicts labels like positive or negative.

Check yourself¶

What is the target label in the IMDb sentiment task?

Whether the review is positive or negative, represented as 1 or 0.

Why do we need vectorization?

Traditional machine-learning models need numerical feature vectors, not raw text strings.

What does bag-of-words ignore?

Word order. It represents documents by token counts or weights.

What problem does TF-IDF address?

It downweights words that occur in many documents and are less useful for distinguishing documents.

Why use a pipeline?

It keeps vectorization and classification together, reducing leakage risk and making training/evaluation cleaner.

Why can grid search be expensive for text classification?

Each parameter combination may require rebuilding text features and training a model across multiple cross-validation folds.

What is out-of-core learning?

Training incrementally on batches so the full dataset does not need to be held in memory.

How is topic modeling different from sentiment classification?

Topic modeling is unsupervised and discovers themes; sentiment classification is supervised and predicts sentiment labels.

Next¶

Next: Cluster Analysis: First Intuition

Source anchors¶

Source file: notebooks/Module2/07-Sentiment Analysis.ipynb
Source datasets: notebooks/Module2/Data/aclImdb, notebooks/Module2/Data/movie_data.csv
Key source concepts: IMDb sentiment classification, bag-of-words, CountVectorizer, term frequency, TF-IDF, text preprocessing, emoticon handling, tokenization, stemming, stop-word removal, logistic regression, scikit-learn pipelines, grid search, out-of-core learning, HashingVectorizer, SGDClassifier, topic modeling with Latent Dirichlet Allocation

Sentiment Analysis¶

Why this matters¶

Mental model¶

Core ideas¶

Walkthrough¶

The IMDb task¶

Why text needs vectorization¶

Bag-of-words¶

TF-IDF¶

Cleaning text¶

Tokenization, stemming, and stop words¶

Train a classifier¶

Why use a pipeline?¶

Grid search¶

Out-of-core learning¶

Topic modeling is a related but different task¶

Explained code examples¶

Classic TF-IDF sentiment pipeline¶

Grid search shape¶

Out-of-core sketch¶

Common traps¶

Check yourself¶

Next¶

Source anchors¶