Restaurants from all over the world can be found here in Bengaluru. From United States to Japan, Russia to Antarctica, you get all type of cuisines here. Delivery, Dine-out, Pubs, Bars, Drinks,Buffet, Desserts you name it and Bengaluru has it. The number of restaurants are increasing day by day. Currently which stands at approximately 12,000 restaurants. With such a high number of restaurants. This industry hasn’t been saturated yet. And new restaurants are opening every day. However it has become difficult for them to compete with already established restaurants. The key issues that continue to pose a challenge to them include high real estate costs, rising food costs, shortage of quality manpower, fragmented supply chain and over-licensing. This Zomato data aims at analyzing demography of the location. Most importantly it will help new restaurants in deciding their theme, menus, cuisine, cost etc for a particular location. It also aims at finding similarity between neighborhoods of Bengaluru on the basis of food. The dataset also contains reviews for each of the restaurant which will help in finding overall rating for the place.
The basic idea of analyzing the Zomato dataset is to get a fair idea about the factors affecting the establishment of different types of restaurant at different places in Bengaluru, aggregate rating of each restaurant, Bengaluru being one such city has more than 12,000 restaurants with restaurants serving dishes from all over the world. With each day new restaurants opening the industry hasn’t been saturated yet and the demand is increasing day by day. Inspite of increasing demand it however has become difficult for new restaurants to compete with established restaurants. Most of them serving the same food. Bengaluru being an IT capital of India. Most of the people here are dependent mainly on the restaurant food. as they don’t have time to cook for themselves. With such an overwhelming demand of restaurants it has therefore become important to study the demography of a location. What kind of a food is more popular in a locality. Does the entire locality loves vegetarian food. If yes then is that locality populated by a particular sect of people for eg. Jain, Marwaris, Gujaratis who are mostly vegetarian.
Objective: Design a machine learning model to predict the rating of the restaurants which accepts the order from zomato.
Prerequisites: This post assumes familiarity with machine learning basic concepts like Linear Regression, Decision Trees, Random Forest, Gradient Boosted Decision Trees, One vs Rest classifiers, Multicollinearity, Model based imputations, CNN, CNN-LSTM, hyperparamter tuning, mean squared error.
- Reading Data- Reading the csv file and storing into a dataframe
- Missing Value imputation-Using model based, mean based and frequency based imputations replace NULL values.
- Exploratory Data Analysis- Graph plots like pieplot, counterplot and barplot
- Data Preprocessing- Removing stopwords and unnecessary characters from the the text data
- Vectorization- Used countervectorizer, tfidfvectorizer and normlizer to vectorize the data
- Building models- Building different machine learning and deep learning models.
Dataset Overview: Each row contains a click record, with the following features.
1. url: contains the url of the restaurant in the zomato website
2. address: contains the address of the restaurant in Bengaluru
3. name: contains the name of the restaurant
4. online_order: whether online ordering is available in the restaurant or not
5. book_table: table book option available or not
6. votes: contains total number of rating for the restaurant as of the above mentioned date
7. phone: contains the phone number of the restaurant
8. location: contains the neighborhood in which the restaurant is located
9. rest_type: restaurant type like Quick Bytes, Casual Bytes.
10. dish_liked: dishes people liked in the restaurant
11. cuisines: food styles, separated by comma
12. approx_cost(for two people): contains the approximate cost for meal for two people
13. reviews_list: list of tuples containing reviews for the restaurant, each tuple consists of two values, rating and review by the customer
14. menu_item: contains list of menus available in the restaurant
-> listed_in(type): type of meal
-> listed_in(city): contains the neighborhood in which the restaurant is listed
Real-world/Business objectives and constraints:
->No strict latency requirement.
-> Interpretability is not important.
Performance Metrics: Hence this is a regression problem so our performance metrics is Mean Squared Error. We will try to reduce the MSE Value as much as possible.
(51717 , 17)
This dataset has 51717 rows and 17 columns.
Checking for percentage of NULL values for each features
Filling the Missing values
We are using 3 different approaches to fill the missing values ie. model based imputation, mean based and frequency based imputation
i. Model Based Imputation method : In order to fill the missing values of the columns “rate” and “dish_liked”, we are using model based imputation
Initially we’ve divided the original dataframe into 2 different dataframes. First dataframe containing no null values and second dataframe containing only null values. we’ve build the model using the first dataframe and find the missing values of second dataframe.
Here is the model to predict the missing values of “dish_liked” column
array([‘Murgh Ghee Roast, Egg Fried Rice, Thali, Mutton Biryani, Naan, Andhra Meal’, ‘Pizza, Mocktails, Coffee, Nachos, Salad, Pasta, Sandwiches’, ‘Pizza, Potato Wedges, Country Feast, Pasta, Garlic Bread, Lemonade’, …, ‘Ferrero Rocher Cake, Chocolate Cake’, ‘Ferrero Rocher Cake, Chocolate Cake’, ‘Ferrero Rocher Cake, Chocolate Cake’], dtype=’<U134')
Here is the model to predict the missing values of “rate” column
array([3.47331694, 3.48851577, 3.44792981, …, 3.58956974, 3.58956974, 3.58956974])
ii. Mean based imputation to find missing Values
iii. Frequency based approach to find missing values
We have checked the most frequently occouring values for these columns and replaced the missing values of columns with the most frequent occouring value.
Exploratory Data Analysis
i. Analysis on Location of restaurant
Conclusion- There is a variation in restaurants as per the locations. BTM has the highest number of the restaurants in Bangalore that 3108 restaurants. New BEL Road contains the least number of restaurants followed by banashankari. Btm has 17.24% of the total restaurants in bangalore
ii. Analysis on online_order
Conclusion- Number of restaurants that allows online order are more than those restaurants who don’t allows online order. There are 29342 restaurants in bangalore which are accepting the online orders and 20098 restaurants which don’t accepts the online order. There are 59.65% of restaurants that allows online ordering
iii. Analysis on ratings
Conclusion- Majority of restaurants has ratings between 3.6 to 3.9. 15% of the restaurants have an approx rating of 3.7 . Minimum rating for the restaurants is 1.8 . There is not even a single restaurant in bangalore where rating is equal to 5.
iv. Analysis on number of stores for each restaurants
Conclusion- There is a variation in the number of stores in bangalore. CCD has maximum number of stores in bangalore followed by onesta and just bake. There are various restaurants that are having only 1 stores such as SV Juice Corner Tiffin, Brown box etc. The total no. of stores of CCD composed of 9.26 % of the entire stores present in bangalore
v. Analysis on Restaurants allows booking of tables
Conclusion- There are 43120 restaurants that are accepting the booking of table and 6320 restaurants that are not accepting the booking of table. Majority of restaurants may be street food type restaurant as it is not allowing booking of table. 87.22% of the restaurants are not allowing the booking of tables
vi. Types of cuisines sold by most of the restaurants
Conclusion — North indian and chinese are the two most sold cuisines in bangalore. Number of restaurants where north indian cuisine is available is close to 20,000 and number of restaurants where chinese food is available is close to 14,000.
vii. Items liked by peoples in Bangalore
Conclusion- Biryani is the most liked dish by the peoples of bangalore. There are around 12000 restaurants where biryani is one of the most famous recipe. Chicken is the second most famous dish liked in bangalore
viii. Analysis on cost of dining
Conclusion- Majority of restaurants in bangalore has average cost for 2 person is 561. The minimum cost for the dining is 40 and maximum cost is 6000. It concludes that there are all sorts of food at different prices are available in bangalore
ix. Analysis on votes
Conclusion The restaurants in Bangalore has an average vote of 296.76 . Minimum vote for the restaurant is 0 and the maximum votes are 16832. Very few restaurants in bangalore has no. of votes greater than 1700
x. Rating of restaurants vs online_order
Conclusion — Only for those restaurants whose rating is 3.7, the number of restaurants accepting online order is more than the restaurants who don’t accepts the online order. For all the restaurants (whose rating is other than 3.7), there are more no. of restaurants that accepts online order rather than the restaurants who don’t accepts the online order.
xi. Type of restaurant
Conclusion — Around 50% of the restaurants in bangalore belongs to the delivery type of restaurants. The least type of restaurants in bangalore belongs to pubs and bars, buffet, drinks and nightlife. Also there are lot of restaurants (34%) which allows dine-out service. In total there are 24728 restaurants that belongs to delivery type. The number of Pubs and bar is 669 which the minimum among all the types of restaurants
Conclusion from this pairplot
- In the plot of votes vs rate, most of restaurants having higher no. of votes has better ratings also
- In the plot of approx_cost vs rate, the restaurant whose rating is high has more price.
- In the graph of rate vs cost, rate vs votes, the data points are linearly separable
- BTM alone has 3108 restaurants which is the highest number of Restaurants in Bangalore as compared to any other location. BEL has the least Number of restaurants ie. 725. Number of restaurants in BTM comprise of 17% of total restaurants.
- The number of restaurants that takes online order is more than those which don’t accepts online order. There are more 29342 restaurants that are accepting online orders and there are 20098 restaurants that are not accepting online order
- There is a variation in ratings of restaurants between 1.8 to 4.9. The average rating of restaurants is 3.7.
- CCD has 93 stores in bangalore which the highest number of stores for any restaurant in bangalore followed by onesta having 85 restaurants.
- There are 43120 restaurants that are accepting the booking of table and 6320 restaurants that are not accepting the booking of table. Majority of restaurants may be street food type restaurant as it is not allowing booking of table
- North Indian, Chinese and South indian are the top 3 cuisines available in the most of restaurants.
- Chicken is the most liked dish by the peoples of bangalore followed by Biryani and rice.
- The average cost of restaurants for the dining is 561. Minimum cost is 40 and max cost is 4000. Overall, 87.22% of the restaurants are not allowing the booking of tables
- Only for those restaurants whose rating is 3.7, the number of restaurants accepting online order is more than the restaurants who don’t accepts the online order. For all the other restaurants (whose rating is other than 3.7), there are more no. of restaurants that accepts online order rather than the restaurants who don’t accepts the online order.
- Around 50% of the restaurants in bangalore belongs to the delivery type of restaurants. The least type of restaurants in bangalore belongs to pubs and bars, buffet, drinks and nightlife. Also there are lot of restaurants (34%) which allows dine-out service. In total there are 24728 restaurants that belongs to delivery type. The number of Pubs and bar is 669 which the minimum among all the types of restaurants
- The maximum no. restaurants that allows table booking has an average rating of 4.2 . The maximum number of restaurants, which don’t allows table booking has an average rating of 3.7 . Irrespective of ratings, the number of restaurants that allows booking of tables are less than the restaurants which don;t allows that.
Checking for multicollinearity
Defining a function to check multicollinearity using vif method
Using label encoding as shown below
Conclusion — Hence by analyzing the vif values, we can conclude that there is no multicollinearity between any independent variables because the vif values are very small for each of the independent variables.
- Total No. of cuisines available in each of the restaurant
2. Total number of dishes liked by the customers. It may be directly proportional to the rating
3. Facilities offered by restaurants : there are 2 major facilities that a restaurant can provide is online order and booking tables. so, here we are summing both of them to find the overall quality of service by the restaurant.
4. This function is used to convert categorical features into response coded features. It simply perform MEAN VALUE REPLACEMENT.
Feature Engineering Summary
- Mean value replacement for dish_liked — Here, first we have done response coding followed by mean value replacement for dish_liked column. We found its value is almost similar to the rate column
- Mean value replacement for cuisines — Here also, first we have done response coding followed by mean value replacement for cuisines column.
- Number of cuisines available- This column contains the total number of cuisines available in each restaurants
- Number of dish_liked — This column contains the total number of dishes liked by the customers in each restaurants.
- Facilities offered — If the restaurant is allowing both online_order and booking_table, then we have given the facilities offered values as 2. If restaurant is allowing either of the them, then we’ve given the values as 1. If the restaurant is not allowing any of the facilities, then we’ve given the value as 0.
Preprocessing of Features
We are removing the stopwords and other special characters that are not essential from the column of preprocessed_reviews. Finally we are replacing the original column of review with the preprocessed_review column.
Here we are using countvectorizer for categorical features, tfidf for text features and normalizer for numerical features.
Countvectorizer for categorical feature :
Shape of training dataset one hot encoding & corresponding class label (23215, 2) (23215,)
Shape of cv dataset one hot encoding & corresponding class label (11435, 2) (11435,)
Shape of test dataset one hot encoding & corresponding class label (17067, 2) (17067,)
Normalizer for numerical feature :
Tfidf for text features :
Hyperparamter tuning for Random forest algorithm
Here we are trying to find the best value of n_estimators and max_depth which provides the minimum mse value for the regression model
RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion=’mse’, max_depth=None, max_features=’auto’, max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False)
Applying Random forest model with best hyperparameters
Deep learning models:
Now, we’ve used few deep learning models to predict the accuracy of the model. we’ve used lstm, lstm-cnn and cnn with conv1d. Although in this problem, the machine learning model are performing better as compared to deep learning models.
Finally, we are comparing the mse values of all the models that we’ve build for predicting the ratings