Note: The code can be found on github.
The objective of this project is to apply various NLP sentiment analysis techniques on reviews of the Yelp Dataset and assess their effectivenes on correctly identifiyng them as positive or negative.
Yelp has released an anonymized part of their stored data to the public. This was accompanied by a challenge with various awards in order to incentivize research and generate insights for the use of the data.
Here follows a brief explanation of the dataset, from their website.
The Challenge Dataset includes data from Phoenix, Las Vegas, Madison, Waterloo and Edinburgh: * 42,153 businesses * 320,002 business attributes * 31,617 check-in sets * 252,898 users * 955,999 edge social graph * 403,210 tips * 1,125,458 reviews
This project is not a participation in the challenge.
The reviews in the Yelp dataset are stored in a single json file. To extract the reviews a parser has been implemented. It takes as input a business category and a number of samples for each category (positive/negative).
First I extracted the anonymized ids for all the businesses that belong to the requested category. These ids are found in a different file than the reviews. Then I parsed the reviews and extracted the requested amount of positive and negative reviews. A review is considered positive when the rating is of 4 or 5 stars and negative when it has 1 or 2 stars. Ratings of 3 stars are not taken into account. The reviews are saved for future processing to two distinct files, one for each class.
Three different techniques will be assesed and compared:
Due to processing power limitations, only a subset of the dataset will be used.
The training on the dataset is done using the bag of words model and a NaiveBayes classifier. The evaluation of the classifier is done by stratified k-fold cross validation. Also the samples are randomly shuffled for each run.
Four different feature extraction algorithms where implemented:
For the bigrams extraction all the bigrams are computed and the n best are selected according to an association measure. By default n = 500 and the metric is based on the Chi-square test.
Sentiment analysis can be done through the lexical analysis of text.
Pros: The advantage of this technique is that it doesn’t require any training and can be applied to any text without any a priori knowledge of the domain.
Cons: On the other hand this technique utilizes heavier sentence manipulation and requires pre-constructed lexicons.
The main steps of this techique are shown in the following diagram. This procedure was formulated in “Reviews Classification Using SentiWordNet Lexicon” by A. Hamouda and M. Rohaim.
The sentence is tokenized and each token is assigned a part-of-speech tag. The POS tag can be adjective, noun, verb etc. A comprehensive list can be found here.
Since a word can have muliple senses(or meanings), a word sense disambiguation(WSD) process is needed. A technique for doing this was adapted from a twitter sentiment analysis tool. It makes use of WordNet which provides various senses for a word and each sense’s definition. For example for the word good as a noun, WordNet contains the following senses and their definitions:
Sense Description ('good.n.01', 'benefit'), ('good.n.02', 'moral excellence or admirableness'), ('good.n.03', 'that which is pleasing or valuable or useful'), ('commodity.n.01', 'articles of commerce')
For each word it checks all the meanings and then selects the one whose definition matches closer the context of the word. That matching is done by simply comparing the frequency of common words between the context and the definition.
In the original implementation the context taken into account for the matching, was the whole review. This would be ok in the case of a small text, e.g. a twitter. Indeed the original author was trying to analyze tweets. But a whole review can be too broad of a context. I tried to improved the algorithm by evaluating each sentence of the review seperately and aggregating over the scores. This provides the WSD algorithm with a more compact, thus relevant, context to work with.
After the meaning has been identified then SentiWordNet is used to derive the final negative or positive score of the word. The score for the text is the summation over the scores of all of it’s words.
Next you can see an example of the objective, positive and negative score for each sense of the word good as a noun.
Sense Obj Pos Neg ('good.n.01', 0.5, 0.5, 0.0), ('good.n.02', 0.125, 0.875, 0.0), ('good.n.03', 0.375, 0.625, 0.0), ('commodity.n.01', 1.0, 0.0, 0.0)
Another WSD technique was adapted from another open source sentiment classifier for twitter. This one was very slow and wasn’t considered at all in the results.
Stanford’s CoreNLP integrates many tools for doing NLP in a cohesive library. It is written in Java and provides a part-of-speech (POS) tagger, a named entity recognizer (NER), a parser, a coreference resolution system, sentiment analysis, and bootstrapped pattern learning tools.
It’s sentiment analysis tool is using deep learning techniques and is trained on 215k phrases extracted from 12k sentences (Penn Treebank format).
For our purposes the provided parser is used to extract the sentences of the text and then the sentiment analysis tool scores them one by one. The score provided is in the range of 0 (very negative) to 4 (very positive) with 2 being neutral.
The estimation of the whole review’s sentiment score is done by aggregating the score of it’s individual sentences. I’ve tried three different techniques on this one. The first one is averaging the score of the sentences, the second one is a weighted average where the weight is the length of the sentence and the third being a count of how many sentences have been positive/negative.
All the results regard the reviews of restaurants. The reviews are extracted in series from the dataset. That means that the experiment with 2000 samples contains the first 1000 positive and the first 1000 negative reviews and so on.
I ran the naive bayes on various sample quantities and for the four differenct feature extraction algorithms. The results of the accuracy can be found in the following table and graph.
|Samples||Single Words||Stopwords Removal||Bigrams||Bigrams & Stopwords|
We can see that using stopwords improves the results in the whole range of number of samples. Also using stopword and bigrams provide similar results when the sample number is high. The best results are achieved by the combination of stopwords removal and bigrams.
From the graph we note that there is a small decline for 4000 samples. This could be attributed to overfitting in low sample numbers. After that the classifier improves and stabilizes as more data is provided.
We have also timed the run for each feature extraction algorithm.
As seen in the following table each run of the single words and the stopword removal take more or less the same time. Bigram extraction is much more time consuming resulting about 7 times slower. All the times refer to the training and testing of all the five datasets.
|Single Words||1m 6s|
|Stopwords Removal||1m 3s|
|Bigrams & Stopwords||6m 13s|
It’s also interesting to see which where the most informative features in each case. These are the feature with the most discrimitative power.
Most Informative Features flavorless = True neg : pos = 65.0 : 1.0 disgusting = True neg : pos = 49.0 : 1.0 tasteless = True neg : pos = 35.6 : 1.0 microwave = True neg : pos = 31.0 : 1.0 filthy = True neg : pos = 26.3 : 1.0 YUM = True pos : neg = 25.0 : 1.0 zero = True neg : pos = 25.0 : 1.0 Gross = True neg : pos = 24.3 : 1.0 inedible = True neg : pos = 23.8 : 1.0 worst = True neg : pos = 22.6 : 1.0
Most Informative Features awful = True neg : pos = 42.3 : 1.0 tasteless = True neg : pos = 39.3 : 1.0 Worst = True neg : pos = 36.3 : 1.0 response = True neg : pos = 31.0 : 1.0 Meh = True neg : pos = 31.0 : 1.0 flavorless = True neg : pos = 28.7 : 1.0 worst = True neg : pos = 27.3 : 1.0 refused = True neg : pos = 24.3 : 1.0 redeeming = True neg : pos = 22.3 : 1.0 lousy = True neg : pos = 22.3 : 1.0
Most Informative Features (u'rude', u'and') = True neg : pos = 50.3 : 1.0 (u'terrible', u'.') = True neg : pos = 40.2 : 1.0 inedible = True neg : pos = 39.0 : 1.0 (u'was', u'bland') = True neg : pos = 37.4 : 1.0 (u'the', u'worst') = True neg : pos = 37.3 : 1.0 downhill = True neg : pos = 35.0 : 1.0 awful = True neg : pos = 34.5 : 1.0 (u'well', u'worth') = True pos : neg = 33.7 : 1.0 (u'was', u'dry') = True neg : pos = 33.4 : 1.0 (u'rude', u'.') = True neg : pos = 33.0 : 1.0
Bigrams & Stopwords
tasteless = True neg : pos = 48.6 : 1.0 disgusting = True neg : pos = 47.8 : 1.0 (u"won't", u'back') = True neg : pos = 37.4 : 1.0 awful = True neg : pos = 35.7 : 1.0 flavorless = True neg : pos = 35.4 : 1.0 (u'perfectly', u'cooked') = True pos : neg = 33.0 : 1.0 (u'.', u'Yum') = True pos : neg = 32.3 : 1.0 YUM = True pos : neg = 31.7 : 1.0 (u'inedible', u'.') = True neg : pos = 30.3 : 1.0 inedible = True neg : pos = 28.6 : 1.0
It’s evident that most of the features characterize negative sentiments.
The results for the SentiWordNet approach can be found in the following table.
|Samples||WSD on review||Time||WSD on sentence||Time|
|1000||0.621||4m 42||0.644||4m 59s|
|2000||0.605||11m 34s||0.638||9m 44s|
|4000||0.602||59m 16s||0.638||54m 9s|
Comparing the two WSD approaches it becomes apparent that using the sentence as the context the results are better and more robust. This confirms the original hypothesis that the sentence provides more relevant information for word sense disambiguation than the whole review.
The samples (positive and negative) are read from the same files that were created by
python. Since they are formated in json there is no need to recreate them.
The accuracy is compared for the three methods of score aggregation mentioned before (average, weighted, counts). Since this was implemented a while after the first two methods I also considered computing the accuracy of each class (pos/neg) and not only the average one.
|Samples||Average (pos/neg/avg)||Weighted (pos/neg/avg)||Counts (pos/neg/avg)||Time|
|1000||0.544 / 0.884 / 0.714||0.552 / 0.930 / 0.741||0.520 / 0.886 / 0.703||41m 3s|
|2000||0.545 / 0.899 / 0.722||0.532 / 0.939 / 0.736||0.521 / 0.898 / 0.710||1h 33m|
|4000||0.565 / 0.910 / 0.737||0.543 / 0.943 / 0.743||0.538 / 0.911 / 0.724||2h 52m|
From the results it becomes evident that CoreNLP gives better results for the negative class rather than the positive. Whether this is a bias needs further investigation.
Comparing the three aggregation methods we see that the weighted averages gives the best overall results and also the best results for the negative class. The simple average gives better results for the positive class.
In these experiments it seems that the the Naive Bayes classifier gave the best overall results. This is expected since it was trained with samples from the the Yelp dataset. The SentiWordNet results where somewhat dissapointing, barely better than random guessing. CoreNLP gave rather good results, given that it came pretrained, but it’s accuracy for the positive class was marginally better than random guessing.
All the methods were timed. Naive Bayes was much faster (even including training time), SentiWordNet was slower by an order of magnitude and CoreNLP was even slower by another order of magnitude.Stathis Fotiadis 12 April 2015