Please give me a five-star rating

Introduction

This project aims to analyze the data set of “Recipies and Ratings”.
This dataset has 17 features and 234429 examples. The relevant features have been listed below:

Number	Feature	Description
1	id	The index of the orders
2	minutes	The time used in minutes to make the meal
3	n_steps	The number of steps used to make the meal
4	nutrition	The detailed amount of nutrition components, including calories, total fat, sugar, sodium, protein, saturated fat, and carbohydrates
5	rating	The number of ratings of the orders

This project will be divided into two parts:

Part 1: analyze the relation between the minutes used in making the meals and the ratings.
Part 2: make a multiclass classification model for predicting the ratings.

Data Cleaning and Exploratory Data Analysis

Data Cleaning

Step 1: split categorical features

The detailed amounts of nutrition components have been saved as strings in the feature “nutrition”, so the first step of data cleaning will be splitting this feature into several quantitative features.

Step 2: fill NaN values

There are a lot of examples showing 0 in ratings since not everyone wants to rate their order, so the second step of data cleaning will be changing the 0 ratings to NaN and filling them with 5.0 since 5.0 is the majority number of the ratings. Below is the data after cleaning.

id	minutes	n_steps	calories	total_fat	sugar	sodium	protein	saturated_fat	carbohydrates	rating	minutes_interval
333281	40	10	138.4	10	50	3	3	19	6	4	[40, 60)
453467	45	12	595.1	46	211	22	13	51	26	5	[40, 60)
306168	40	6	194.8	20	6	32	22	36	3	5	[40, 60)
306168	40	6	194.8	20	6	32	22	36	3	5	[40, 60)
306168	40	6	194.8	20	6	32	22	36	3	5	[40, 60)

Univariate Analysis

First, let’s see the amount distribution of different minute intervals as below:

Also, we can see the distribution of n_steps as below:

Bivariate Analysis

Below is the mean of the ratings at every minute interval:

We can see that the mean of ratings does not strictly decrease as the minutes increase, meaning that some customers are willing to give a high rating to the meals that take more time and are more exquisite.

Next, we can see the mean of the n_steps at every minute intervals:

We can see that the steps increase as the minutes increase, but slower after 20 minutes are taken.

Aggregation Analysis

Here is the detailed pivot table of the minutes interval and the ratings.

minutes_interval	count	mean
(0, 20)	52165	4.73114
[20, 40)	71600	4.69728
[40, 60)	43866	4.68821
[60, ∞)	56949	4.69752

Imputation

Here we choose to fill the NaN ratings with 5.0 since it is the majority, and it can be proven by the fact that most people tend to give a 5.0 rating even though the meal is bad.

Before the imputation, the distribution of ratings is shown below:

After the imputation, the distribution of ratings is shown below:

There is no 0 rating after imputation.

Framing a Prediction Problem

The rest of the parts of this project will focus on setting up a multiclass classification model for predicting the ratings based on some features in the dataset.

Baseline Model

This part aims to set up a prediction model based on 2 features. Here is how the model is set up:

Step 1: Select features

The quantitative features chosen here to set up a basic model are “minutes” and “n_steps” which represent the time complexity of making a meal. There are no categorical features selected.

Step 2: Split Training and Testing Data

Set 80% of examples as training data, and 20% of examples as testing data.

Step 3: Set up model

Use the K-Neighbors model to classify the ratings, and use GridSearchCV to find out the optimal amount of neighbors in the scope [3, 5]. The reason that the optimal amount of neighbors may not be 5 is that somebody will give a 5.0 rating even though the meal only deserves 3.0 or 2.0, or give a 3.0 to a perfect meal deserving a 5.0 rating. The actual distribution of the meals in the dataset may not have 5 rating intervals.

Step 4: Fit Data and Evaluate Performance
The scores of the model on training and testing data are shown below:

Data	Score
Training	0.7546698281236085
Testing	0.7488200195921275

Final Model

The final model will predict the ratings based on more valuables.

The steps of setting up this model are shown below:

Step 1: Add Features

These features are added to the selected features:

calories
total_fat
sugar
sodium
protein
saturated_fat
carbohydrates

These features will make the model predict the ratings not only based on the time and complexity but also the nutrition.

Step 2: Split Training and Testing Data

Set 80% of examples as training data, and 20% of examples as testing data, and the random state is set as same as 23. Using different

Step 3: Set up model

Also use the K-Neighbors model to classify the ratings, and use GridSearchCV to find out the optimal amount of neighbors in the scope [3, 5].

Step 4: Fit Data and Evaluate Performance
The scores of the model on training and testing data are improved as below:

Data	Score
Training	0.799998886810936
Testing	0.7517143111586072

There is a small but also valuable improvement after adding several features.

Conclusion

This project analyzes the relation between the minutes taken in making a meal and the ratings and also tries to predict the rating based on the minutes and the nutrition. In the future, more complex methods will be used and more features will be added in the model.