Rating Prediction of Restaurants

Description

Restaurants from all over the world can be found here in Bengaluru. From United States to Japan, Russia to Antarctica, you get all type of cuisines here. Delivery, Dine-out, Pubs, Bars, Drinks,Buffet, Desserts you name it and Bengaluru has it. The number of restaurants are increasing day by day. Currently which stands at approximately 12,000 restaurants. With such a high number of restaurants. This industry hasn’t been saturated yet. And new restaurants are opening every day. However it has become difficult for them to compete with already established restaurants. The key issues that continue to pose a challenge to them include high real estate costs, rising food costs, shortage of quality manpower, fragmented supply chain and over-licensing. This Zomato data aims at analyzing demography of the location. Most importantly it will help new restaurants in deciding their theme, menus, cuisine, cost etc for a particular location. It also aims at finding similarity between neighborhoods of Bengaluru on the basis of food. The dataset also contains reviews for each of the restaurant which will help in finding overall rating for the place.

The basic idea of analyzing the Zomato dataset is to get a fair idea about the factors affecting the establishment of different types of restaurant at different places in Bengaluru, aggregate rating of each restaurant, Bengaluru being one such city has more than 12,000 restaurants with restaurants serving dishes from all over the world. With each day new restaurants opening the industry hasn’t been saturated yet and the demand is increasing day by day. Inspite of increasing demand it however has become difficult for new restaurants to compete with established restaurants. Most of them serving the same food. Bengaluru being an IT capital of India. Most of the people here are dependent mainly on the restaurant food. as they don’t have time to cook for themselves. With such an overwhelming demand of restaurants it has therefore become important to study the demography of a location. What kind of a food is more popular in a locality. Does the entire locality loves vegetarian food. If yes then is that locality populated by a particular sect of people for eg. Jain, Marwaris, Gujaratis who are mostly vegetarian.

Objective: Design a machine learning model to predict the rating of the restaurants which accepts the order from zomato.

Prerequisites: This post assumes familiarity with machine learning basic concepts like Linear Regression, Decision Trees, Random Forest, Gradient Boosted Decision Trees, One vs Rest classifiers, Multicollinearity, Model based imputations, CNN, CNN-LSTM, hyperparamter tuning, mean squared error.

  1. Reading Data- Reading the csv file and storing into a dataframe
  2. Missing Value imputation-Using model based, mean based and frequency based imputations replace NULL values.
  3. Exploratory Data Analysis- Graph plots like pieplot, counterplot and barplot
  4. Data Preprocessing- Removing stopwords and unnecessary characters from the the text data
  5. Vectorization- Used countervectorizer, tfidfvectorizer and normlizer to vectorize the data
  6. Building models- Building different machine learning and deep learning models.

Dataset Overview: Each row contains a click record, with the following features.

1. url: contains the url of the restaurant in the zomato website

2. address: contains the address of the restaurant in Bengaluru

3. name: contains the name of the restaurant

4. online_order: whether online ordering is available in the restaurant or not

5. book_table: table book option available or not

6. votes: contains total number of rating for the restaurant as of the above mentioned date

7. phone: contains the phone number of the restaurant

8. location: contains the neighborhood in which the restaurant is located

9. rest_type: restaurant type like Quick Bytes, Casual Bytes.

10. dish_liked: dishes people liked in the restaurant

11. cuisines: food styles, separated by comma

12. approx_cost(for two people): contains the approximate cost for meal for two people

13. reviews_list: list of tuples containing reviews for the restaurant, each tuple consists of two values, rating and review by the customer

14. menu_item: contains list of menus available in the restaurant

-> listed_in(type): type of meal

-> listed_in(city): contains the neighborhood in which the restaurant is listed

Real-world/Business objectives and constraints:

->No strict latency requirement.

-> Interpretability is not important.

Performance Metrics: Hence this is a regression problem so our performance metrics is Mean Squared Error. We will try to reduce the MSE Value as much as possible.

(51717 , 17)

This dataset has 51717 rows and 17 columns.

We are using 3 different approaches to fill the missing values ie. model based imputation, mean based and frequency based imputation

i. Model Based Imputation method : In order to fill the missing values of the columns “rate” and “dish_liked”, we are using model based imputation

Initially we’ve divided the original dataframe into 2 different dataframes. First dataframe containing no null values and second dataframe containing only null values. we’ve build the model using the first dataframe and find the missing values of second dataframe.

Here is the model to predict the missing values of “dish_liked” column

array([‘Murgh Ghee Roast, Egg Fried Rice, Thali, Mutton Biryani, Naan, Andhra Meal’, ‘Pizza, Mocktails, Coffee, Nachos, Salad, Pasta, Sandwiches’, ‘Pizza, Potato Wedges, Country Feast, Pasta, Garlic Bread, Lemonade’, …, ‘Ferrero Rocher Cake, Chocolate Cake’, ‘Ferrero Rocher Cake, Chocolate Cake’, ‘Ferrero Rocher Cake, Chocolate Cake’], dtype=’<U134')

Here is the model to predict the missing values of “rate” column

array([3.47331694, 3.48851577, 3.44792981, …, 3.58956974, 3.58956974, 3.58956974])

ii. Mean based imputation to find missing Values

iii. Frequency based approach to find missing values

We have checked the most frequently occouring values for these columns and replaced the missing values of columns with the most frequent occouring value.

Exploratory Data Analysis

Fig-1

Conclusion- There is a variation in restaurants as per the locations. BTM has the highest number of the restaurants in Bangalore that 3108 restaurants. New BEL Road contains the least number of restaurants followed by banashankari. Btm has 17.24% of the total restaurants in bangalore

Conclusion- Number of restaurants that allows online order are more than those restaurants who don’t allows online order. There are 29342 restaurants in bangalore which are accepting the online orders and 20098 restaurants which don’t accepts the online order. There are 59.65% of restaurants that allows online ordering

Conclusion- Majority of restaurants has ratings between 3.6 to 3.9. 15% of the restaurants have an approx rating of 3.7 . Minimum rating for the restaurants is 1.8 . There is not even a single restaurant in bangalore where rating is equal to 5.

Conclusion- There is a variation in the number of stores in bangalore. CCD has maximum number of stores in bangalore followed by onesta and just bake. There are various restaurants that are having only 1 stores such as SV Juice Corner Tiffin, Brown box etc. The total no. of stores of CCD composed of 9.26 % of the entire stores present in bangalore

Conclusion- There are 43120 restaurants that are accepting the booking of table and 6320 restaurants that are not accepting the booking of table. Majority of restaurants may be street food type restaurant as it is not allowing booking of table. 87.22% of the restaurants are not allowing the booking of tables

Conclusion — North indian and chinese are the two most sold cuisines in bangalore. Number of restaurants where north indian cuisine is available is close to 20,000 and number of restaurants where chinese food is available is close to 14,000.

Conclusion- Biryani is the most liked dish by the peoples of bangalore. There are around 12000 restaurants where biryani is one of the most famous recipe. Chicken is the second most famous dish liked in bangalore

Conclusion- Majority of restaurants in bangalore has average cost for 2 person is 561. The minimum cost for the dining is 40 and maximum cost is 6000. It concludes that there are all sorts of food at different prices are available in bangalore

Conclusion The restaurants in Bangalore has an average vote of 296.76 . Minimum vote for the restaurant is 0 and the maximum votes are 16832. Very few restaurants in bangalore has no. of votes greater than 1700

Conclusion — Only for those restaurants whose rating is 3.7, the number of restaurants accepting online order is more than the restaurants who don’t accepts the online order. For all the restaurants (whose rating is other than 3.7), there are more no. of restaurants that accepts online order rather than the restaurants who don’t accepts the online order.

Conclusion — Around 50% of the restaurants in bangalore belongs to the delivery type of restaurants. The least type of restaurants in bangalore belongs to pubs and bars, buffet, drinks and nightlife. Also there are lot of restaurants (34%) which allows dine-out service. In total there are 24728 restaurants that belongs to delivery type. The number of Pubs and bar is 669 which the minimum among all the types of restaurants

Conclusion from this pairplot

  1. In the plot of votes vs rate, most of restaurants having higher no. of votes has better ratings also
  2. In the plot of approx_cost vs rate, the restaurant whose rating is high has more price.
  3. In the graph of rate vs cost, rate vs votes, the data points are linearly separable

EDA Summary

  • BTM alone has 3108 restaurants which is the highest number of Restaurants in Bangalore as compared to any other location. BEL has the least Number of restaurants ie. 725. Number of restaurants in BTM comprise of 17% of total restaurants.
  • The number of restaurants that takes online order is more than those which don’t accepts online order. There are more 29342 restaurants that are accepting online orders and there are 20098 restaurants that are not accepting online order
  • There is a variation in ratings of restaurants between 1.8 to 4.9. The average rating of restaurants is 3.7.
  • CCD has 93 stores in bangalore which the highest number of stores for any restaurant in bangalore followed by onesta having 85 restaurants.
  • There are 43120 restaurants that are accepting the booking of table and 6320 restaurants that are not accepting the booking of table. Majority of restaurants may be street food type restaurant as it is not allowing booking of table
  • North Indian, Chinese and South indian are the top 3 cuisines available in the most of restaurants.
  • Chicken is the most liked dish by the peoples of bangalore followed by Biryani and rice.
  • The average cost of restaurants for the dining is 561. Minimum cost is 40 and max cost is 4000. Overall, 87.22% of the restaurants are not allowing the booking of tables
  • Only for those restaurants whose rating is 3.7, the number of restaurants accepting online order is more than the restaurants who don’t accepts the online order. For all the other restaurants (whose rating is other than 3.7), there are more no. of restaurants that accepts online order rather than the restaurants who don’t accepts the online order.
  • Around 50% of the restaurants in bangalore belongs to the delivery type of restaurants. The least type of restaurants in bangalore belongs to pubs and bars, buffet, drinks and nightlife. Also there are lot of restaurants (34%) which allows dine-out service. In total there are 24728 restaurants that belongs to delivery type. The number of Pubs and bar is 669 which the minimum among all the types of restaurants
  • The maximum no. restaurants that allows table booking has an average rating of 4.2 . The maximum number of restaurants, which don’t allows table booking has an average rating of 3.7 . Irrespective of ratings, the number of restaurants that allows booking of tables are less than the restaurants which don;t allows that.

Defining a function to check multicollinearity using vif method

Using label encoding as shown below

Conclusion — Hence by analyzing the vif values, we can conclude that there is no multicollinearity between any independent variables because the vif values are very small for each of the independent variables.

Feature Engineering

  1. Total No. of cuisines available in each of the restaurant

2. Total number of dishes liked by the customers. It may be directly proportional to the rating

3. Facilities offered by restaurants : there are 2 major facilities that a restaurant can provide is online order and booking tables. so, here we are summing both of them to find the overall quality of service by the restaurant.

4. This function is used to convert categorical features into response coded features. It simply perform MEAN VALUE REPLACEMENT.

Feature Engineering Summary

  1. Mean value replacement for dish_liked — Here, first we have done response coding followed by mean value replacement for dish_liked column. We found its value is almost similar to the rate column
  2. Mean value replacement for cuisines — Here also, first we have done response coding followed by mean value replacement for cuisines column.
  3. Number of cuisines available- This column contains the total number of cuisines available in each restaurants
  4. Number of dish_liked — This column contains the total number of dishes liked by the customers in each restaurants.
  5. Facilities offered — If the restaurant is allowing both online_order and booking_table, then we have given the facilities offered values as 2. If restaurant is allowing either of the them, then we’ve given the values as 1. If the restaurant is not allowing any of the facilities, then we’ve given the value as 0.

Preprocessing of Features

We are removing the stopwords and other special characters that are not essential from the column of preprocessed_reviews. Finally we are replacing the original column of review with the preprocessed_review column.

Here we are using countvectorizer for categorical features, tfidf for text features and normalizer for numerical features.

Countvectorizer for categorical feature :

[‘no’, ‘yes’]
Shape of training dataset one hot encoding & corresponding class label (23215, 2) (23215,)
Shape of cv dataset one hot encoding & corresponding class label (11435, 2) (11435,)
Shape of test dataset one hot encoding & corresponding class label (17067, 2) (17067,)

Normalizer for numerical feature :

Tfidf for text features :

Here we are trying to find the best value of n_estimators and max_depth which provides the minimum mse value for the regression model

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion=’mse’, max_depth=None, max_features=’auto’, max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False)

0.027927709527412244

Now, we’ve used few deep learning models to predict the accuracy of the model. we’ve used lstm, lstm-cnn and cnn with conv1d. Although in this problem, the machine learning model are performing better as compared to deep learning models.

Finally, we are comparing the mse values of all the models that we’ve build for predicting the ratings

www.appliedaicourse.com

https://www.analyticsvidhya.com/blog/2015/11/easy-methods-deal-categorical-variables-predictive-modeling

https://ijcat.com/archieve/volume8/issue9/ijcatr08091008.pdf

https://towardsdatascience.com/sentiment-analysis-for-hotel-reviews-3fa0c287d82e

Data Science aspirant