Skip to content

Sentiment Analysis

Why this matters

Sentiment analysis turns text into a prediction about attitude: positive, negative, or sometimes neutral. Businesses use it for reviews and customer feedback. Social scientists use it for opinions and public discourse. Product teams use it to summarize complaints and praise.

The notebook builds a classic machine-learning sentiment pipeline on IMDb movie reviews:

raw review text -> cleaned text -> numerical features -> classifier -> sentiment prediction

This lesson focuses on that pipeline. The details matter because text is not naturally numerical, and machine-learning models need numbers.

Mental model

A sentiment classifier does not "read" a review like a human. In the classic approach, it learns statistical associations between words and labels.

Example:

"excellent, moving, unforgettable" -> likely positive
"boring, awful, waste" -> likely negative

The model sees many labeled examples, converts each review into a feature vector, and learns which patterns tend to indicate positive or negative sentiment.

Core ideas

  • Sentiment analysis is supervised text classification when labels are available.
  • Text must be converted into numerical features before a traditional ML model can use it.
  • Bag-of-words represents documents by token counts, ignoring word order.
  • TF-IDF downweights words that appear in many documents and are therefore less discriminative.
  • Text preprocessing removes or normalizes noise such as HTML, punctuation, casing, and sometimes stop words.
  • Tokenization splits text into units such as words.
  • Stemming maps related word forms to a shared root, but can also distort words.
  • Pipelines combine vectorization and classification into one trainable workflow.
  • Grid search compares preprocessing and model hyperparameters.
  • Out-of-core learning trains incrementally when the dataset is too large to fit comfortably in memory.
  • Topic modeling is related, but it is unsupervised and tries to discover themes, not sentiment labels.

Walkthrough

The IMDb task

The dataset contains 50,000 movie reviews:

  • positive reviews: IMDb rating greater than 6
  • negative reviews: IMDb rating less than 5

The goal is to train a model that predicts whether a new review is positive or negative.

The notebook first assembles individual text files into a CSV with two columns:

review, sentiment

where sentiment is 1 for positive and 0 for negative.

Why text needs vectorization

Models such as logistic regression cannot directly use raw text.

This is not a valid model input:

"This movie was surprisingly good."

The text must become numbers:

[0, 2, 0, 1, 0, 0, 1, ...]

Each position in the vector corresponds to a token in the vocabulary.

Bag-of-words

Bag-of-words has two steps:

  1. Build a vocabulary of unique tokens.
  2. Count how often each token appears in each document.

For these documents:

The sun is shining
The weather is sweet

the vocabulary might be:

is, shining, sun, sweet, the, weather

Each document becomes a count vector over that vocabulary.

The trade-off is simple:

  • bag-of-words is easy and effective
  • it ignores word order

So:

not good

can be hard to distinguish from:

good

unless the model uses n-grams or other features.

TF-IDF

Raw word counts can overvalue common words. Words such as is, the, and movie may occur often without helping much.

TF-IDF adjusts word counts using this idea:

A word is more useful when it appears often in this document but not in almost every document.

So a word like excellent may carry more signal than the, even if the appears more often.

The practical scikit-learn shortcut is TfidfVectorizer, which combines token counting and TF-IDF weighting.

Cleaning text

IMDb reviews contain HTML markup, punctuation, capitalization, and emoticons.

A simple cleaner can:

  • remove HTML tags
  • lowercase text
  • remove many non-word characters
  • preserve emoticons because they can carry sentiment

Example:

import re


def preprocessor(text):
    text = re.sub(r"<[^>]*>", "", text)
    emoticons = re.findall(r"(?::|;|=)(?:-)?(?:\)|\(|D|P)", text)
    text = re.sub(r"[\W]+", " ", text.lower())
    text = text + " " + " ".join(emoticons).replace("-", "")
    return text.strip()

This is a teaching cleaner, not a universal NLP cleaner. In real projects, preprocessing choices should match the data and task.

Tokenization, stemming, and stop words

Tokenization splits text into tokens:

def tokenizer(text):
    return text.split()

Stemming reduces related words:

runners, running, runs -> run

This can reduce vocabulary size, but it can also remove useful nuance or produce unnatural stems.

Stop-word removal removes common words such as is, and, the. It can help with raw counts, but TF-IDF already downweights frequent words. Removing stop words is not always a win.

The notebook's grid search compares some of these choices instead of assuming one is always best.

Train a classifier

The main classifier is logistic regression. Despite the name, logistic regression is a classification model.

The pipeline is:

TfidfVectorizer -> LogisticRegression

That means:

  1. Convert text reviews into TF-IDF feature vectors.
  2. Train logistic regression to predict sentiment labels.

The notebook reports roughly 90% accuracy with the tuned model on IMDb reviews.

Why use a pipeline?

A scikit-learn pipeline keeps preprocessing and modeling together.

This matters because the vectorizer must be fit only on training data during evaluation. If the vectorizer sees the test set before evaluation, information leaks from test to train.

The pipeline helps enforce the right workflow.

Grid search tries different combinations:

  • tokenizer vs stemmed tokenizer
  • stop words vs no stop words
  • TF-IDF vs raw counts
  • L1 vs L2 regularization
  • different regularization strengths

The benefit is systematic comparison. The cost is time. Text vectorization over 50,000 reviews can be expensive, and grid search trains many models.

Out-of-core learning

Sometimes the dataset is too large to hold in memory or vectorize all at once.

Out-of-core learning streams examples in batches and updates the model incrementally.

The notebook uses:

  • HashingVectorizer: converts tokens into a fixed-size feature space without storing a vocabulary
  • SGDClassifier: supports incremental learning with partial_fit

The trade-off:

  • much more memory efficient
  • usually less interpretable and sometimes slightly less accurate

The notebook later introduces Latent Dirichlet Allocation, also called LDA. This is not Linear Discriminant Analysis from the dimensionality reduction lesson.

Here, LDA means topic modeling:

discover hidden groups of words that tend to appear together across documents.

Sentiment analysis is supervised classification. Topic modeling is unsupervised structure discovery.

Explained code examples

Classic TF-IDF sentiment pipeline

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

model = Pipeline([
    ("vect", TfidfVectorizer(preprocessor=preprocessor)),
    ("clf", LogisticRegression(max_iter=1000)),
])

model.fit(X_train, y_train)
accuracy = model.score(X_test, y_test)

What this teaches:

  • TfidfVectorizer turns text into numerical features
  • LogisticRegression learns the positive/negative decision boundary
  • Pipeline keeps vectorization and classification together
  • score evaluates on held-out test data

Grid search shape

from sklearn.model_selection import GridSearchCV

param_grid = {
    "vect__ngram_range": [(1, 1), (1, 2)],
    "vect__stop_words": [None, "english"],
    "clf__C": [1.0, 10.0],
}

search = GridSearchCV(
    model,
    param_grid=param_grid,
    scoring="accuracy",
    cv=5,
    n_jobs=-1,
)

search.fit(X_train, y_train)

What this teaches:

  • vect__... parameters belong to the vectorizer step
  • clf__... parameters belong to the classifier step
  • cross-validation estimates performance for each setting
  • the search can be slow because it trains many pipelines

Out-of-core sketch

from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier

vect = HashingVectorizer(
    decode_error="ignore",
    n_features=2**21,
    preprocessor=None,
    tokenizer=tokenizer,
)

clf = SGDClassifier(loss="log_loss")

for X_batch, y_batch in stream_batches("Data/movie_data.csv"):
    X_batch = vect.transform(X_batch)
    clf.partial_fit(X_batch, y_batch, classes=[0, 1])

What this teaches:

  • HashingVectorizer avoids storing the full vocabulary
  • partial_fit updates the model batch by batch
  • this is useful when data is too large for a normal in-memory workflow

Common traps

Sentiment analysis understands emotion like a person.

A classic model learns statistical patterns in labeled text. It can work well but still miss sarcasm, context, and domain-specific meaning.

Bag-of-words preserves sentence meaning.

Bag-of-words ignores word order. That is simple and powerful, but it loses syntax.

More preprocessing is always better.

Stemming and stop-word removal can help or hurt. Test them instead of assuming.

TF-IDF is magic.

TF-IDF mostly reweights terms so common words count less and document-specific words count more.

Grid search gives the true best model.

It gives the best model among the parameter choices you tried, under the validation setup you used.

High accuracy means the model is ready for every review site.

IMDb movie reviews are one domain. Customer support tickets, tweets, or product reviews may need new evaluation and possibly retraining.

Topic modeling is sentiment analysis.

Topic modeling discovers themes without labels. Sentiment classification predicts labels like positive or negative.

Check yourself

What is the target label in the IMDb sentiment task?

Whether the review is positive or negative, represented as 1 or 0.

Why do we need vectorization?

Traditional machine-learning models need numerical feature vectors, not raw text strings.

What does bag-of-words ignore?

Word order. It represents documents by token counts or weights.

What problem does TF-IDF address?

It downweights words that occur in many documents and are less useful for distinguishing documents.

Why use a pipeline?

It keeps vectorization and classification together, reducing leakage risk and making training/evaluation cleaner.

Why can grid search be expensive for text classification?

Each parameter combination may require rebuilding text features and training a model across multiple cross-validation folds.

What is out-of-core learning?

Training incrementally on batches so the full dataset does not need to be held in memory.

How is topic modeling different from sentiment classification?

Topic modeling is unsupervised and discovers themes; sentiment classification is supervised and predicts sentiment labels.

Source anchors

  • Source file: notebooks/Module2/07-Sentiment Analysis.ipynb
  • Source datasets: notebooks/Module2/Data/aclImdb, notebooks/Module2/Data/movie_data.csv
  • Key source concepts: IMDb sentiment classification, bag-of-words, CountVectorizer, term frequency, TF-IDF, text preprocessing, emoticon handling, tokenization, stemming, stop-word removal, logistic regression, scikit-learn pipelines, grid search, out-of-core learning, HashingVectorizer, SGDClassifier, topic modeling with Latent Dirichlet Allocation