Movie review classification - Part 1
Posted by Maxime Kan in posts
In this project, let us tackle a very well known ML problem: how to classifiy movie reviews from the IMDB platform. Out data set is composed of reviews that are labeled in a binary way, that is, whether the grade associated to the text by the reviewer is positive or negative.
A lot of ressources can be found about this problem online, and we want to use this very classic ML problem to go through different Natural Language Processing methodologies to compare their performance, our results can then be compared to other benchmarks in the ML literature.
In this post, we will only use a linear model - the Logistic Regression - to perform the classification task. In a later post, we will investigate what more powerful techniques such as Neural Networks can add to this problem.
1. Load data¶
As always, let us first state what packages we will be using. The data set comes from keras.datasets, making it easy to use.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import string
from keras.datasets import imdb
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, cross_validate
from sklearn.metrics import accuracy_score
import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=3000)
word_to_index = imdb.get_word_index()
2. Data preparation¶
Reformating the reviews¶
There is not much data preparation to be done here because this Keras package does most of the work for us beforehand. However, it does a little too much preprocessing to our taste, because we are loading... numeric indices instead of words!
print(train_data[0])
This is because Keras has already done a word-to-index matching, referenced in the word_map dictionary we loaded above. Let us map the indices back to words to get an idea what these reviews actually look like.
word_to_index = dict([(key,(value+3)) for key,value in word_to_index.items()])
word_to_index["<PAD>"] = 0
word_to_index["<START>"] = 1
word_to_index["<UNK>"] = 2 # unknown
word_to_index["<UNUSED>"] = 3
index_to_word = dict([(value, key) for (key, value) in word_to_index.items()])
def to_words(review):
return [index_to_word[i] for i in review]
This is what the index list above looks like... slightly more intuitive! It can easily be seen that the label will be 1 (positive) because of the presence of words like "brilliant", "amazing", "loved", "great", " would recommend", "lovely" etc.
print("label: %s" % train_labels[0])
print("review: %s" % " ".join(to_words(train_data[0])))
We could work directly with vectors of indices (this is actually the intention of the Keras data set). However, this is limiting in terms of preprocessing, and so let us convert the data back to words like above.
train_reviews = np.array([to_words(review) for review in train_data])
test_reviews = np.array([to_words(review) for review in test_data])
Quick data exploration¶
Let us quickly check if the train data is balanced between negative and positive reviews:
print("Number of positive reviews in the train data: %i" % sum(train_labels == 1))
print("Number of negative reviews in the train data: %i" % sum(train_labels == 0))
if sum(train_labels == 1) == sum(train_labels == 0):
print("\n The data is perfectly balanced!")
else:
print("\n The data is not balanced")
Now let us just analyze how many words these reviews usually have to get a rough idea of what we are actually talking about here!
length_reviews = np.array([len(review) - 1 for review in train_data])
plt.hist(length_reviews, bins=50)
plt.title("Length of reviews in train set (number of words)")
plt.show()
print("On average, there are %i words in negative reviews" % np.mean(length_reviews[train_labels == 0]))
print("On average, there are %i words in positive reviews" % np.mean(length_reviews[train_labels == 1]))
As we could have expected, positive reviews tend to be a little longer. More generally, we can see from the histogram that the vast majority of reviews contain less than 1000 words.
Now that we are all set, let"s go!
3. Preparing the bag of words¶
Bag of words (BoW) is the most essential feature in Natural Language Processing. Basically, it works like dissembling a Lego castle: you throw all the bricks in a box, and once everything is in the box, you don"t care about which brick was where anymore. Same goes for bags of words, where all words are thrown into a bag without remembering where each word was located in the sentence. Then, we vectorize each bag of words (in our case, each review) by associating a frequency to each word. Good reviews will have larger frequencies for words like "great", bad reviews will have larger frequencies for words like "disappointing" etc.
Before using the BoW, we need to make sure however each word is a uniform unit. Let us quickly go through a couple of text processing techniques for this purpose.
Length of the reviews¶
As we"ve seen from the data exploration part, some reviews are really long and we probably do not need to analyse thousands of words to figure out what the sentiment of a review is. Let us cap the number of words we allow our reviews to have so that it gets easier to process the data (and more reasonable).
cap = 500
train_reviews = np.array([review[0: min(len(review), cap)] for review in train_reviews])
test_reviews = np.array([review[0: min(len(review), cap)] for review in test_reviews])
Words format¶
Easy! Keras has done everything for us already: the data is all lowercase, there are no punctuation signs... Who could ask for more? Now, there are still numbers (like years etc) that we would like to remove because we don"t want years to play a role in our analysis, but purely words.
def remove_numeric(review):
return [word for word in review if word.isalpha()]
train_reviews = np.array([remove_numeric(review) for review in train_reviews])
test_reviews = np.array([remove_numeric(review) for review in test_reviews])
Stopwords¶
Stopwords are words we decide to eliminate from the bag of words because we decide they are not relevant. First, let us eliminate the four words we introduced in the Data Processing step. Then, we can also add english stopwords ("he", "be", "i" etc). This step does not necessarily help, so we"ll try with and without later on.
stopwords_keras = ["<PAD>","<START>","<UNK>","<UNUSED>"]
stopwords_english = stopwords.words("english")
def remove_stopwords(review):
return [word for word in review if review not in stopwords_keras]
train_reviews = np.array([remove_stopwords(review) for review in train_reviews])
test_reviews = np.array([remove_stopwords(review) for review in test_reviews])
Stemming¶
Stemming... Stemming is a really nice idea. Shouldn"t "loving" and "love" be considered as the same word because they convey the same meaning? Stemmers aim to remove grammatical suffixes and reduce words to their roots. There are several stemming algorithms out there. The SnowballStemmer from nltk is widely used, but you might prefer other algorithms.
However sensible this may sound, using stemming does not always improve model performance that significantly. We"ll see for ourselves if this adds anything for our project!
stemmer = SnowballStemmer("english")
print(stemmer.stem("loving"))
def stemming(review):
return [stemmer.stem(word) for word in review]
train_reviews_stemmed = np.array([stemming(review) for review in train_reviews])
test_reviews_stemmed = np.array([stemming(review) for review in test_reviews])
train_reviews = np.array([" ".join(review) for review in train_reviews])
test_reviews = np.array([" ".join(review) for review in test_reviews])
train_reviews_stemmed = np.array([" ".join(review) for review in train_reviews_stemmed])
test_reviews_stemmed = np.array([" ".join(review) for review in test_reviews_stemmed])
4. Classification with the Logistic Regression¶
Now that we have preprocessed the review data, we are ready to start with the modeling part. In this project, we will use the Logistic Regression for this. We will follow a two step approach:
First, we need to convert our Bag of Words to a vectorized form. Intuitively, it is like assigning frequencies to each word of our sentences and putting them together in one long (very long) vector. Here, instead of just computing a word count, we will use the celebrated Tf-idf vectorizer, which also takes into account how often a word appears in other reviews.
Once this is done, these vectorized versions of the reviews are pushed to our Logistic Regression model.
To optimize the performance of our model, we will gridsearch the best parameters:
For the Tf-idf vectorization
- n_gram_range: on how many successive words the vectorization is being performed - what size of ngrams we allow
- min_df: in how many documents at least can each word ngram appear
- max_df: in how many documents at most can each word can appear
- stop_words: whether we use the stopword list we defined above
For the Logistic Regression
- C: how strong the regularization of the logistic regression should be
These two steps are put together in a sklearn pipeline. We will train this pipeline on the non-stemmed data first and run the gridsearch on this dataset.
pipe = make_pipeline(TfidfVectorizer(), LogisticRegression(solver="lbfgs"), memory="cache_folder")
param_grid = {"tfidfvectorizer__ngram_range": [(1, 1), (1, 2), (1, 3)],
"tfidfvectorizer__min_df": [0.025, 0.05],
"tfidfvectorizer__max_df": [0.9, 0.8],
"tfidfvectorizer__stop_words": [None, stopwords_english],
"logisticregression__C": [0.1, 1]}
Non-stemmed reviews¶
grid = GridSearchCV(pipe, param_grid, scoring="accuracy", cv=3, return_train_score=False)
grid.fit(train_reviews, train_labels)
best_score_non_stemmed = grid.best_score_
print("The logistic regression achieves a %0.2f accuracy on the validation set" % best_score_non_stemmed)
Our logistic regression achieved 85% accuracy on the validation set, which is good given that this is using only a linear model! It is achieved for the following parameters:
best_params_non_stemmed = grid.best_params_
pd.DataFrame({
"parameter": list(best_params_non_stemmed.keys()),
"best value": list(best_params_non_stemmed.values())
})
grid_results = pd.DataFrame(grid.cv_results_)
grid_results["param_tfidfvectorizer__stop_words"] = grid_results["param_tfidfvectorizer__stop_words"].apply(
lambda x: "remove stopwords" if x else "with stopwords"
)
parameter_names = ["param_logisticregression__C", "param_tfidfvectorizer__max_df", "param_tfidfvectorizer__min_df",
"param_tfidfvectorizer__ngram_range", "param_tfidfvectorizer__stop_words"]
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes_list = axes.ravel()
axes[-1, -1].axis("off")
fig.suptitle("Average validation score for each choice of parameter")
for i in range(5):
sns.boxplot(x=parameter_names[i], y="mean_test_score", data=grid_results, ax=axes_list[i])
axes_list[i].set_xlabel(parameter_names[i].split("_", 1)[1])
The results of the gridsearch highlighted in these boxplots allow us to draw the following conclusions:
- The logistic regression works best with less regularization. In other words, it is beneficial to assign weights to a large number of words
- Ignoring words appearing with a low frequency in the reviews results in much poorer prediction accuracy
- On the other hand, removing words that appear with a very high frequency in the reviews does not impact our prediction accuracy
- Adding bigrams to the word vectorization improves the model, but increasing it further to trigrams does not
- Removing stopwords impacts prediction accuracy negatively
pipe_best = make_pipeline(TfidfVectorizer(ngram_range=best_params_non_stemmed["tfidfvectorizer__ngram_range"],
max_df=best_params_non_stemmed["tfidfvectorizer__max_df"],
min_df=best_params_non_stemmed["tfidfvectorizer__min_df"],
stop_words=best_params_non_stemmed["tfidfvectorizer__stop_words"]),
LogisticRegression(solver="lbfgs", C=best_params_non_stemmed["logisticregression__C"]),
memory="cache_folder")
Stemmed reviews¶
Now, let us use the stemmed reviews. As mentioned above, it usually gives a small boost to Natural Language Processing. However, we will not go through the gridsearch process again for the stemmed reviews here. Of course, it can be done, but it is likely to bring little value for a lot of computational time (running the gridsearch above takes some time!). Instead, let us just compare how the best model identified above works.
cross_val_stemmed = cross_validate(
pipe_best, train_reviews_stemmed, train_labels, scoring="accuracy", cv=3, return_train_score=False
)
cross_val_stemmed_score = np.mean(cross_val_stemmed["test_score"])
print("Using stemming for the reviews, we now have a validation accuracy score of %0.2f" %cross_val_stemmed_score)
if cross_val_stemmed_score>best_score_non_stemmed:
print("Stemming has increased our prediction accuracy!")
else:
print("Stemming has decreased our prediction accuracy!")
As expected, stemming does increase our validation score a little, but the improvement is not that substantial either.
Now, at this stage, we have finished building the model. We have found the parameters to use for the vectorization of our reviews, that it is better not to use stopwords, to use stemming, and that it is better not to use any regularization on the logistic regression.
The reason why Logistic Regression is such a nice tool for NLP problems is that you can quantify how much each word contributes to a negative or a positive sentiment. This is done thanks to the weights that get assigned to each of the words after the vectorization. Let us look at it on the non-stemmed dataset, as it is slightly more readable!
vect = TfidfVectorizer(ngram_range=best_params_non_stemmed["tfidfvectorizer__ngram_range"],
max_df=best_params_non_stemmed["tfidfvectorizer__max_df"],
min_df=best_params_non_stemmed["tfidfvectorizer__min_df"],
stop_words=best_params_non_stemmed["tfidfvectorizer__stop_words"])
vect_train_reviews = vect.fit_transform(train_reviews)
feature_names = vect.get_feature_names()
lr = LogisticRegression(solver="lbfgs", C=best_params_non_stemmed["logisticregression__C"])
lr.fit(vect_train_reviews, train_labels)
#All credits for this plot go to Prof A. Mueller, who teaches Applied Machine Learning at Columbia University
def plot_important_features(coef, feature_names, top_n=20, ax=None, rotation=40):
if ax is None:
ax = plt.gca()
inds = np.argsort(coef)
low = inds[:top_n]
high = inds[-top_n:]
important = np.hstack([low, high])
myrange = range(len(important))
colors = ["red"] * top_n + ["blue"] * top_n
ax.bar(myrange, coef[important], color=colors)
ax.set_xticks(myrange)
ax.set_xticklabels(feature_names[important], rotation=rotation, ha="right")
ax.set_xlim(-.7, 2 * top_n)
ax.set_frame_on(False)
plt.figure(figsize=(15, 6))
plot_important_features(lr.coef_.ravel(), np.array(feature_names), top_n=20, rotation=40)
plt.title("Top 20 most and less significant words in the IMDB reviews")
All of this makes a lot of sense! Note that the vast majority of these words are unigrams, with the exception of "the worst" and "the best". This shows that the benefits of using bigrams is quite limited and that using a straightforward review vectorization could have worked just as well too.
Before wrapping it up, let us discuss the main downfall of this model (because yes, it has some...). Here are two reviews:
pipe_best.fit(train_reviews_stemmed, train_labels)
reviews = ["i loved this movie it was amazing",
"this movie was not good at all"]
The first is straightforward. But the second expresses the opposite by a negation. As a result, there is still the word "good" in it, and our model is not great at understanding how "not" impacts "good" to reverse its meaning. Hence, the first review gets correctly labeled as positive with a probability of 97%, whereas the second review gets labeled as negative with only 51% probability.
pipe_best.predict_proba(stemming(reviews))
This illustrates that our model is only looking at word blocks, but is not able to retrieve any semantic from the syntax of the reviews. It is the main downside of this model.
Testing the model¶
In the previous steps, we have evaluated our model using cross validation. This has allowed to identify what modeling steps yielded the best results. Now, let us see how our model performs with reviews from the test set.
test_score = accuracy_score(test_labels, pipe_best.predict(test_reviews_stemmed))
print("Our classifier has an accuracy of %0.2f on the test set" % test_score)
This is really good isn"t it? Just for fun, let us try out our model on two more Hitchcock reviews found online (both are very positive reviews)
a) A review of Notorious (Alfred Hitchcock, 1946) by From Frank Cottrell Boyce, The Guardian 2012 (full review here)
"Notorious is perfect. Everyone knows that. It"s a testament to Ben Hecht"s complex, headlong script that so many people have tried to rip it off and a testament to Hitchcock"s genius that no one has ever succeeded. Take a look at the gabby, inconsequential, forgotten Mission Impossible: II and you"ll see what I mean. The more obvious glories of Notorious include a revelatory performance from Cary Grant as the morally exhausted American agent Devlin, a terrifying Nazi-mother super-villain played by Leopoldine Konstantin and cinema"s most cunningly prolonged kiss."
b) A review of Vertigo (Alfred Hitchcock, 1958) by Peter Bradshaw, The Guardian 2018 (full review here)
"When I watched this again, I felt more strongly than ever that Hitchcock’s decision to give us a story in which the Clouzot-esque twist is given away well before the end is no misjudgment. It is a brilliant way of putting us inside Judy’s tormented, guilty soul, and of avoiding, just for a while, that male gaze. I also realised what it is Vertigo has been subtly reminding me of for many years: Graham Greene’s The End of the Affair. A treat to see this back on the big screen."
notorious_review = "Notorious is perfect. Everyone knows that. It"s a testament to Ben Hecht"s complex, headlong script that so many people have tried to rip it off and a testament to Hitchcock"s genius that no one has ever succeeded. Take a look at the gabby, inconsequential, forgotten Mission Impossible: II and you"ll see what I mean. The more obvious glories of Notorious include a revelatory performance from Cary Grant as the morally exhausted American agent Devlin, a terrifying Nazi-mother super-villain played by Leopoldine Konstantin and cinema"s most cunningly prolonged kiss."
vertigo_review = "When I watched this again, I felt more strongly than ever that Hitchcock’s decision to give us a story in which the Clouzot-esque twist is given away well before the end is no misjudgment. It is a brilliant way of putting us inside Judy’s tormented, guilty soul, and of avoiding, just for a while, that male gaze. I also realised what it is Vertigo has been subtly reminding me of for many years: Graham Greene’s The End of the Affair. A treat to see this back on the big screen."
hitchcock_reviews = [notorious_review, vertigo_review]
hitchcock_reviews = ["".join(ch for ch in review.lower() if ch not in string.punctuation) for review in hitchcock_reviews]
pipe_best.predict_proba(stemming(hitchcock_reviews))
The Notorious review is labeled as positive with 71%, the Vertigo review gets labeled as positive with 92% probability.
6. Conclusion¶
In this post, we have gone through some of the most widely used Natural Language Processing techniques and trained a linear model to classify movie reviews. There are more fancy ways of doing this, and we will cover other techniques soon, but topics covered in this project are solid basics to start with and relatively easy to achieve.
We achieved a 87% test accuracy. Even though there are some downfalls that we addressed, it is along the lines of what is usually achieved in other ML sources with a linear model.