Skip to main content
Problem statement: Most of the Restaurants ask reviews to the customers and based on the reviews the restaurant can improve the customer satisfaction.
So Reviews plays a vital role for the successful growth of the restaurant.




# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python 
Docker image: https://github.com/kaggle/docker-python


import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
/kaggle/input/reviews/Restaurant_Reviews.tsv

1. Text Preprocessing

2. Text Visualization

3. Sentiment Analysis

4. Sentiment Modeling

In [2]:
from warnings import filterwarnings
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from PIL import Image
from nltk.corpus import stopwords
from nltk.sentiment import SentimentIntensityAnalyzer
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, GridSearchCV, cross_validate
from sklearn.preprocessing import LabelEncoder
from textblob import Word, TextBlob
from wordcloud import WordCloud
/opt/conda/lib/python3.7/site-packages/nltk/twitter/__init__.py:20: UserWarning: The twython library has not been installed. Some functionality from the twitter package will not be available.
  warnings.warn("The twython library has not been installed. "
In [3]:
filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 200)
pd.set_option('display.float_format', lambda x: '%.2f' % x)
In [4]:
df = pd.read_csv("../input/reviews/Restaurant_Reviews.tsv",delimiter="\t")

1. Text Preprocessing

In [5]:
df.head()
Out[5]:
Review Liked
0 Wow… Loved this place. 1
1 Crust is not good. 0
2 Not tasty and the texture was just nasty. 0
3 Stopped by during the late May bank holiday of… 1
4 The selection on the menu was great and so wer… 1
In [6]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Review  1000 non-null   object
 1   Liked   1000 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 15.8+ KB
In [7]:
df.shape
Out[7]:
(1000, 2)

Normalizing Case Folding

In [8]:
df['Review'] = df['Review'].str.lower()
In [9]:
df.head()
Out[9]:
Review Liked
0 wow… loved this place. 1
1 crust is not good. 0
2 not tasty and the texture was just nasty. 0
3 stopped by during the late may bank holiday of… 1
4 the selection on the menu was great and so wer… 1

Punctuations

In [10]:
df['Review'] = df['Review'].str.replace('[^\w\s]', '')
In [11]:
df['Review'] = df['Review'].str.replace('\d', '')

Stopwords

It allows us to get rid of commonly used words.

In [12]:
sw = stopwords.words('english')
df['Review'] = df['Review'].apply(lambda x: " ".join(x for x in str(x).split() if x not in sw))

Rarewords

We drop words according to their frequencies.

In [13]:
drops = pd.Series(' '.join(df['Review']).split()).value_counts()[-250:]
df['Review'] = df['Review'].apply(lambda x: " ".join(x for x in x.split() if x not in drops))

Tokenization

break sentences into parts

In [14]:
df["Review"].apply(lambda x: TextBlob(x).words).head()
Out[14]:
0                                  [wow, loved, place]
1                                        [crust, good]
2                              [tasty, texture, nasty]
3    [stopped, late, may, bank, holiday, rick, reco...
4                     [selection, menu, great, prices]
Name: Review, dtype: object

Lemmatization / Stemming

is the process of separating words by root

In [15]:
df['Review'] = df['Review'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
In [16]:
df["Review"].head()
Out[16]:
0                                      wow loved place
1                                           crust good
2                                  tasty texture nasty
3    stopped late may bank holiday rick recommendat...
4                           selection menu great price
Name: Review, dtype: object

2. Text Visualization

Calculate the term frequencies

In [17]:
tf = df["Review"].apply(lambda x: pd.value_counts(x.split(" "))).sum(axis=0).reset_index()
In [18]:
tf.columns = ["words", "tf"]
tf.head()
Out[18]:
words tf
0 loved 10.00
1 wow 3.00
2 place 111.00
3 crust 2.00
4 good 95.00
In [19]:
Out[19]:
(1591, 2)
In [20]:
tf["words"].nunique()
Out[20]:
1591
In [21]:
tf["tf"].describe([0.05, 0.10, 0.25, 0.50, 0.75, 0.80, 0.90, 0.95, 0.99]).T
Out[21]:
count   1591.00
mean       3.38
std        7.15
min        1.00
5%         1.00
10%        1.00
25%        1.00
50%        1.00
75%        3.00
80%        4.00
90%        7.00
95%       12.00
99%       27.00
max      124.00
Name: tf, dtype: float64
In [22]:
tf.sort_values("tf", ascending = False)
Out[22]:
words tf
76 food 124.00
2 place 111.00
4 good 95.00
40 service 83.00
16 great 70.00
896 handling 1.00
897 rowdy 1.00
903 rest 1.00
904 ache 1.00
1590 wound 1.00

1591 rows × 2 columns

Barplot

In [23]:
tf[tf["tf"] > 25].plot.bar(x="words", y="tf")
plt.show()

Wordcloud

In [24]:
text = " ".join(i for i in df.Review)
wordcloud = WordCloud().generate(text)
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()


wordcloud = WordCloud(max_font_size=50,
                      max_words=100,
                      background_color="white").generate(text)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

wordcloud.to_file("wordcloud.png")
Out[24]:
<wordcloud.wordcloud.WordCloud at 0x7f35238aebd0>

3. Sentiment Analysis

In [25]:
sia = SentimentIntensityAnalyzer()
sia.polarity_scores("The food was awesome")
Out[25]:
{'neg': 0.0, 'neu': 0.423, 'pos': 0.577, 'compound': 0.6249}
In [26]:
df["Review"].apply(lambda x: x.upper())
Out[26]:
0                                        WOW LOVED PLACE
1                                             CRUST GOOD
2                                    TASTY TEXTURE NASTY
3      STOPPED LATE MAY BANK HOLIDAY RICK RECOMMENDAT...
4                             SELECTION MENU GREAT PRICE
                             ...                        
995                    THINK FOOD FLAVOR TEXTURE LACKING
996                              APPETITE INSTANTLY GONE
997                      OVERALL IMPRESSED WOULD GO BACK
998    WHOLE EXPERIENCE UNDERWHELMING THINK WELL GO S...
999    WASTED ENOUGH LIFE POURED SALT WOUND DRAWING T...
Name: Review, Length: 1000, dtype: object
In [27]:
df["Review"][0:10].apply(lambda x: sia.polarity_scores(x))
Out[27]:
0    {'neg': 0.0, 'neu': 0.115, 'pos': 0.885, 'comp...
1    {'neg': 0.0, 'neu': 0.256, 'pos': 0.744, 'comp...
2    {'neg': 0.643, 'neu': 0.357, 'pos': 0.0, 'comp...
3    {'neg': 0.141, 'neu': 0.37, 'pos': 0.489, 'com...
4    {'neg': 0.0, 'neu': 0.423, 'pos': 0.577, 'comp...
5    {'neg': 0.645, 'neu': 0.215, 'pos': 0.14, 'com...
6    {'neg': 0.395, 'neu': 0.605, 'pos': 0.0, 'comp...
7    {'neg': 0.0, 'neu': 0.598, 'pos': 0.402, 'comp...
8    {'neg': 0.0, 'neu': 0.196, 'pos': 0.804, 'comp...
9    {'neg': 0.0, 'neu': 0.196, 'pos': 0.804, 'comp...
Name: Review, dtype: object
In [28]:
df["polarity_score"] = df["Review"].apply(lambda x: sia.polarity_scores(x)["compound"])
In [29]:
df.head()
Out[29]:
Review Liked polarity_score
0 wow loved place 1 0.83
1 crust good 0 0.44
2 tasty texture nasty 0 -0.56
3 stopped late may bank holiday rick recommendat… 1 0.69
4 selection menu great price 1 0.62

4. Sentiment Modeling

Create the target

In [30]:
df["Review"][0:10].apply(lambda x: "pos" if sia.polarity_scores(x)["compound"] > 0 else "neg")
Out[30]:
0    pos
1    pos
2    neg
3    pos
4    pos
5    neg
6    neg
7    pos
8    pos
9    pos
Name: Review, dtype: object

Apply for all data

In [31]:
df["sentiment_label"] = df["Review"].apply(lambda x: "pos" if sia.polarity_scores(x)["compound"] > 0 else "neg")
df.head(20)
Out[31]:
Review Liked polarity_score sentiment_label
0 wow loved place 1 0.83 pos
1 crust good 0 0.44 pos
2 tasty texture nasty 0 -0.56 neg
3 stopped late may bank holiday rick recommendat… 1 0.69 pos
4 selection menu great price 1 0.62 pos
5 getting angry want damn pho 0 -0.69 neg
6 honeslty didnt taste fresh 0 -0.24 neg
7 potato like rubber could tell made time kept w… 0 0.57 pos
8 fry great 1 0.62 pos
9 great touch 1 0.62 pos
10 service prompt 1 0.00 neg
11 would go back 0 0.00 neg
12 cashier care ever say still ended wayyy overpr… 0 0.49 pos
13 tried cape cod ravoli chicken cranberrymmmm 1 0.00 neg
14 disgusted pretty sure human hair 0 0.27 pos
15 shocked indicate 0 -0.32 neg
16 highly recommended 1 0.27 pos
17 waitress little slow service 0 0.00 neg
18 place worth time let alone vega 0 -0.03 neg
19 like 0 0.36 pos
In [32]:
df.groupby("sentiment_label")["Liked"].mean()
Out[32]:
sentiment_label
neg   0.22
pos   0.75
Name: Liked, dtype: float64
In [33]:
df["sentiment_label"] = LabelEncoder().fit_transform(df["sentiment_label"])
In [34]:
X = df["Review"]
y = df["sentiment_label"]

Count Vectors

converts text to a term count vector.

In [35]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X_count = vectorizer.fit_transform(X)

vectorizer.get_feature_names()[10:15]
X_count.toarray()[10:15]
Out[35]:
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

TF-IDF

It is the weight factor calculated with the statistical method that shows the importance of a term in the document.

In [36]:

Logistic Regression

In [38]:
Out[38]:
0.813
In [39]:
Out[39]:
0    great food price high quality house made
dtype: object

Random Forests

0.8319999999999999

TF-IDF Word-Level

0.8059999999999998
0.8089999999999999

Hyperparameter Optimization

In [44]:
rf_model = RandomForestClassifier(random_state=17)

rf_params = {"max_depth": [5, 8, None],
             "max_features": [5, 7, "auto"],
             "min_samples_split": [2, 5, 8, 20],
             "n_estimators": [100, 200, 500]}

rf_best_grid = GridSearchCV(rf_model,
                            rf_params,
                            cv=5,
                            n_jobs=-1,
                            verbose=True).fit(X_count, y)

rf_best_grid.best_params_

rf_final = rf_model.set_params(**rf_best_grid.best_params_, random_state=17).fit(X_count, y)

cv_results = cross_validate(rf_final, X_count, y, cv=3, scoring=["accuracy", "f1", "roc_auc"])

cv_results['test_accuracy'].mean()
cv_results['test_f1'].mean()
cv_results['test_roc_auc'].mean()
Fitting 5 folds for each of 108 candidates, totalling 540 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    9.1s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   43.3s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  2.0min
[Parallel(n_jobs=-1)]: Done 540 out of 540 | elapsed:  2.9min finished
Out[44]:
0.9004796656164906

Leave a Reply