Predicting Prices with Machine Learning and Deep Learning Regression

High Level Overview:

Today we’re going to investigate whether deep learning can be a potential avenue for modeling compared to machine learning for smaller data. We’ll create regression models using various ensemble and regression methods as well as a regression based deep learning model. We’ll evaluate them using Root Mean Squared Error (RSME) and will finish out with seeing which model is better.

Problem: We are starting to build more robust models that will eventually be deployed for our mobile phone business to develop the next great smartphone at a good price for the Nigerian Naira. We want to leverage machine learning to do so. However, some of our colleagues have been noting why should we be using old techniques when machine learning techniques are getting outclassed by deep learning. If we could use deep learning for projects it would allow us to be more robust as a data science organization. Understanding the challenge, we’ll test it out to see if deep learning is worth using for this project.

The conditions are clear: using data that has price ranges, create a regression model for the price ranges. It doesn’t need to be pretty, but it needs to at least make some sense for regression for the data since the price bucketing that they have is obscuring the true nature of the data, which will need some convincing from stakeholders to give over.

Data: The data comes from the Kaggle project Mobile Price Classification dataset, which “emulates” the data our colleagues will normally see.

 Exploratory Data Analysis First, let’s do some exploratory data analysis.

We’re first going to look at the target variable to see the distribution.

In this distribution, we see that it’s fairly bucketed. We know from the expectation of the company that this bucketing is really to hide the true data. 

We see that many of these variables have a wide range of numbers, some of which are categorical. We also saw that the shape of the date is 2000 rows with 21 columns.

Description of Data: As we can see there are variables that focus on the smart phone experience. For instance, Blue, meaning Bluetooth, is one of the features, as well as how fast the RAM (ram) is, how many cores (n_cores) are in the phone and the phone pixel screen dimensions (px_heigh, px_width).

The target variable, pice range, is on a value scale of 1 to 3. This scale is important because it is scaled to the 100,000 Nigerian Niara.

Complication: During the EDA process we assume that all variables are continuous values:

Seeing these values for the descriptions we know that values like Bluetooth are binary and price rage is strict categorization, let’s separate them as categorical values.

Let’s focus on seeing what the non-categorical data summary statistics look like.

We’ve done the easy part of looking into this data, but we now need to see more of the categorical data, let’s separate them and see the frequency of these values.

We could see with the categorical data. Most of it leans towards the phones having this feature than not.

Potential Complication: With the outcome variable price_range, we may have to be worried about an unbalanced dataset. Luckily, looking at the values, all all of them have the same frequency and are balanced.

Data Preprocessing: Now we need to do ourselves a favor and look into feature engineering. In this case, we don’t need to worry about engineering features or processing data because this is toy data, and it’s already been done by our data engineer. However, we need to focus on feature selection, and we can approach it by first looking to see correlated variables.

So we first look at highly correlated variables and see that most of the variables we have are not really correlated except for price range and ram. We will drop it and we have a data set that has somewhat stronger, but not fully strongly correlated variables

That’s good because it means we can focus on triaging low-variance variables. So now let’s look into variance with the data set.

Now we see that variance is generally high across the board, but there are a couple of stragglers that do not have as high of a variance compared to others. Let’s remove those features by focusing on high-variance features.

Modeling:

Now we’re into the thick of it, we now need to focus on modeling, and the best way to handle modeling is to create a pipeline so let’s create a pipeline that can run multiple different machine learning algorithms.

Initial Model

Now to check out the initial regression, we decided to implement a simple linear regression with RSME as the evaluation metric:

As implementing the initial linear regression we see RSME at 1.06 for training and 1.09 for testing. Let’s create a histogram of the predicted test results (we will also generate histograms like these throughout the rest of the blog).

We see that we have values that are within the range of the price range variable. We also know that the RSME maybe a bit too high. It’s worth exploring more models and more parameters.

Hyperparamter Tuning

We see that we have a whole host of regression base models (Linear, ElasticNet, and Ridge), but we also have extreme gradient boosting, and random forest as well to test out. These are common techniques used to model data, and now we can see if it will do the trick for us in our modeling case, so let’s model the data.  We have multiple inputs for the method’s variables to tune the hyper parameters.

I remember some of our colleagues are also teaching that deep learning does the trick. Well, we’re going to grab a method from recent literature, called Local GLMNet. Created by Ronald Richman and Mario V. Wüthrich, this implementation creates a model similar to a generalized linear model, but implementable within Deep Learning networks. This method will allow us to create a deep-learning model that can do regression base work. It’s very simple to build up, but it’s worth going through with it. We will borrow from the paper’s implementation extensively.

We’re going to use Keras and do the work. As you can tell from the structure, it’s a few dense networks with the max nodes being the max number of features in the form of maxlen.

We created a series of dense networks before we created an activation layer, which will trigger their activation to create the regression value. WE have the loss value as mse which stands for mean squared error. We can take that optimization and then just apply a square root function for the output. Now let’s run the results and see what

We’ve built the model, but how do we evaluate it? We know that our models are focused predominantly on progression, even though the data for the price range is in a classification-style format. The answer is to do both, where will evaluate using root mean squared error, and use that for both bucketed and non-bucketed. For the sake of this work will focus, primarily on the non-bucketed version. You can check out more in the Github Jupiter notebook.

Metrics with Justification. We will evaluate the model using Root Mean Squared Error (RSME). IT allows us to know much does the data fit the regression line produced by the model. The smaller the RSME the better fit the prediction is to the actual value and closer to the regression line produced.

Results:

Machine Learning Model

Now let’s look at the results

First what was the best model from the machine learning grid search:

We see that it is Random Forest with 500 estimators as the best machine learning model. Now let’s look at how the model created its predictions visually through a historgram.

Deep Learning Model:

Now we will put it in a comparison table to inspect

Comparison Table:

Let’s put the comparison into a table

EvaluationLinear RegressionRandom ForestDeep Learning (Local GLMNet)
Train RSME1.060.401.24
Test RSME1,091.121.39
Out of bound
predictions
NoNoYes
Table presenting the evaluation metrics of the model.

We see that overall the data for Random Forest has a low root mean square error for both testing and training meaning the model is producing good results. We see that the Deep Learning Model has a higher root mean squared error than the previous model. The distribution of the predicted values are off compared to the reference data and even the regular machine learning data. The Deep Learning frequency histogram of predicted values has negative value bins, which is impossible in our data set.

Conclusion

What this tells us is that maybe deep learning isn’t the right tool for us to use for this data set. This makes sense because we don’t have much data, and deep learning takes a whole lot of data to run properly.  Thus the best technique, Random Forest, should be used to aid in our creation of the next smart phone. The model produced from Random Forest will make prediction in line with what are expectations are for the price range and even produce values that would make sense.

Now there’s a lot more to explore in the data set what you can check in the Jupiter notebook on the Github repository. What we were able to demonstrate today is that sometimes it’s not about how powerful is a technique is well rather what you use the technique for. We saw today that with random forest we were able to generate a model that could best do the predictions that we need for the Price prediction for mobile phones.

Improvements:

There are many improvement we can do to this model. We can see if other Deep Learning and Machine Learning techniques we can leverage. For instance, light gradient boosting as a machine learning technique may improve regression results making it the best regression model. We can also make the Deep Learning network more suited for regression by looking for newer techniques that improve upon the infrastructure of Deep Learning for regression. Finally, more data would be helpful. The problem with this data set is that this is a training set with labeled data and it is small. Small data does not suit Deep Learning that well, and so if we got more labeled data we could potentially make a new mode. I say labeled data because we also have a test set that is unlabeled. In the Github Repository, we also check out how the data predicts labels, which is something that could validate the model further if we got online data to verify.

Acknoweldgment:

I’d like to acknowledge the following:

My family

Kaggle

Udacity

Leave a comment