Comparing machine learning models for a regression problem

Comparing regression models

Comparing different machine learning models for a regression problem is necessary to find out which model is the most efficient and provide the most accurate result. There are many test criteria to compare the models. In this article, we will take a regression problem, fit different popular regression models and select the best one of them. 

We have discussed how to compare different machine learning problems when we have a classification problem in hand (the article is here). That means in such cases the response variable is a categorical one. Different popular classification algorithms are compared to come out with the best algorithm. 

NB: Being a non-native English speaker, I always take extra care to proofread my articles with Grammarly. It is the best grammar and spellchecker available online. Read here my review of using Grammarly for more than two years. 

Comparing regression models

So, what if the response variable is a continuous one and not categorical. This is a problem of regression then and we have to use regression models to estimate the predicted values. In this case also several candidate regression models. Our task is to find the one which serves our purpose. 

So, in this article, we are taking a regression problem of predicting the value of a continuous variable. We will compare several regression models, compare their performance calculating the prediction accuracy and several goodnesses of fit statistics. 

Here I have used five most prominent and popular regression models and compared them according to their prediction accuracy. The supervised models used here are

The models were compared using two very popular model comparison metrics namely Mean Absolute Error(MAE) and Mean Square Error (MSE). The expressions for these two metrics are as below:

Mean Absolute Error(MAE)

Comparing different machine learning models for a regression problem involves an important part of comparing original and estimated values. If \dpi{150} y is the response variable and \dpi{150} \widehat{y} is the estimate then MAE is the error between these \dpi{150} n pair of variables and calculated with this equation:

MAE is a scale-dependent metric that means it has the same unit as the original variables. So, this is not a very reliable statistic when comparing models applied to different series with different units. It measures the mean of the absolute error between the true and estimated values of the same variable. 

Mean Square Error (MSE)

This metric of model comparison is as the name suggests calculate the mean of the squares of the error between true and estimated values. So, the equation is as below:

Python code for comparing the models

So, now the comparison between different machine learning models is conducted using python. We will see step by step application of all the models and how their performance can be compared.

Loading required libraries

All the required libraries are first loaded here.

import numpy as np # linear algebra
import pandas as pd # data processing
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn import metrics
from pandas import DataFrame,Series
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
import matplotlib
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.model_selection import train_test_split,cross_val_score, cross_val_predict
import missingno as msno # plotting missing data
import seaborn as sns # plotting library
from sklearn import svm

The example data and its preprocessing

The data set used here is the car data set from Github and you can access the data file from this link. The data set has the following independent variables:

  • Age
  • Gender
  • Average miles driven per day
  • Personal debt and
  • Monthly income

Based on these independent variables we have to predict the potential sale value of a car. So, here the response variable is the sale value of the car and it is a continuous variable. That is why the problem in hand is a regression problem.

Importing the data

The below piece of code will use the panda library read() function to import the data set into the working space. The describe() function is for a brief idea about the data.

dataset = pd.read_csv("cars.csv")
dataset.describe()

Displaying the last few columns of the data set to have a glimpse of the data and variables.

Last few columns of the data set

Check the data for missing values

The following code is to check if there any missing value in the data set. Missing value creates a problem in the analysis process. So, we should filter these values in the data pre-processing stage. Here we will find out which columns contain missing values and the corresponding rows will be simply dropped from the data set.

# Finding all the columns with NULL values
dataset.isna().sum()
# Drop the rows with missing values
dataset = dataset.dropna()

Creating basic plots with the data

Here we create the joint distribution plot of the independent variables

sns.pairplot(dataset[['age', 'miles', 'debt', 'income', 'sales']], diag_kind="kde")
Joint distribution plot of the  independent variables
Joint distribution plot of the independent variables

Splitting the data set

Data splitting is required to create training and testing data sets from the same car data. I have taken 80% of the whole data set as training data and the rest 20% of data as the test data set. The following python code is for this splitting purpose.

train_dataset = dataset.sample(frac=0.8,random_state=0)
test_dataset = dataset.drop(train_dataset.index)

Normalizing the training data set

First of all we will see the summary statistics of all the variables using the describe() function of sklearn library.

# Calculating basic statistics with the train data
train_stats = train_dataset.describe()
train_stats.pop("sales") # excluding the dependent variable
train_stats = train_stats.transpose()
train_stats

Here from the below stats about the data set, we can see that different variables in the data set has very large range and deviations, which may create problem during model fitting. So, before we use this variables in model building process, we will normalize the variables.

Summary statistics of the  training data set
Summary statistics of the training data set

Creating a function for normalization

Using the mean and standard deviation of each of the variables, we will convert them into standard normal variates. For that purpose, we will create the function as in below.

# Creating the normalizing function with mean and standard deviation
def norm(x):
  return (x - train_stats['mean']) / train_stats['std']
normed_train_data = norm(train_dataset)
normed_test_data = norm(test_dataset)

Separating the response variable and creating other variables

Now a most important step to store the response variable in a separate variable.

train_labels = train_dataset.pop("sales") # using .pop function to store only the dependent variable
test_labels = test_dataset.pop("sales")
x_train=normed_train_data
x_test=normed_test_data
y_train=train_labels
y_test=test_labels

As we are now finished with the data pre-processing stage, we will start with the modelling steps. So, let’s start coding for all the five models I have mentioned to predict the car sale price.

Application of Multiple Linear Regression

First of all Multiple Linear Regression (MLR). This simple linear regression only but we will include all the independent variables to estimate the car sale price. The LinearRegression() function from LinearModel module of sklearn library has been used here for the purpose.

lin_reg = LinearRegression()
lin_reg.fit(x_train,y_train)
#Prediction using test set 
y_pred = lin_reg.predict(x_test)
mae=metrics.mean_absolute_error(y_test, y_pred)
mse=metrics.mean_squared_error(y_test, y_pred)
# Printing the metrics
print('R2 square:',metrics.r2_score(y_test, y_pred))
print('MAE: ', mae)
print('MSE: ', mse)
Metrics for MLR

Application of Decision Tree regression

dt_regressor = DecisionTreeRegressor(random_state = 0)
dt_regressor.fit(x_train,y_train)
#Predicting using test set 
y_pred = dt_regressor.predict(x_test)
mae=metrics.mean_absolute_error(y_test, y_pred)
mse=metrics.mean_squared_error(y_test, y_pred)
# Printing the metrics
print('Suppport Vector Regression Accuracy: ', dt_regressor.score(x_test,y_test))
print('R2 square:',metrics.r2_score(y_test, y_pred))
print('MAE: ', mae)
print('MSE: ', mse)
Metrics for Decision tree
Metrics for Decision tree

Application of Random Forest Regression

rf_regressor = RandomForestRegressor(n_estimators = 300 ,  random_state = 0)
rf_regressor.fit(x_train,y_train)
#Predicting the SalePrices using test set 
y_pred = rf_regressor.predict(x_test)
mae=metrics.mean_absolute_error(y_test, y_pred)
mse=metrics.mean_squared_error(y_test, y_pred)
# Printing the metrics
print('Suppport Vector Regression Accuracy: ', rf_regressor.score(x_test,y_test))
print('R2 square:',metrics.r2_score(y_test, y_pred))
print('MAE: ', mae)
print('MSE: ', mse)
Metrics for Random Forest regression

Application of Support Vector Regression

from sklearn.svm import SVR
regressor= SVR(kernel='rbf')
regressor.fit(x_train,y_train)
y_pred_svm=regressor.predict(x_test)
#y_pred_svm = cross_val_predict(regressor, x, y)
mae=metrics.mean_absolute_error(y_test, y_pred_svm)
mse=metrics.mean_squared_error(y_test, y_pred_svm)
# Printing the metrics
print('Suppport Vector Regression Accuracy: ', regressor.score(x_test,y_test))
print('R2 square:',metrics.r2_score(y_test, y_pred_svm))
print('MAE: ', mae)
print('MSE: ', mse)
Metrics for Support Vector Regression

Application of Deep Learning using Keras library

Here is the deep learning model mentioned. A sequential model has been used. The model has been created as a function named build_model so that we can call it anytime it is required in the process. The model has two connected hidden layers with a Rectified Linear Unit (relu) function and an output layer with a linear function.

The hidden layers have 12 and 8 neurons respectively with all the 8 input variables. Mean Squared Error is the loss function here as it is the most common loss function in case of regression problems.

def build_model():
  model = keras.Sequential([
    layers.Dense(12,kernel_initializer='normal', activation='relu', input_shape=[len(train_dataset.keys())]),
    layers.Dense(8, activation='relu'),
    layers.Dense(1, activation='linear')
  ])

  optimizer = tf.keras.optimizers.RMSprop(0.001)

  model.compile(loss='mse',
                optimizer=optimizer,
                metrics=['mae', 'mse'])
  return model
model = build_model()

Displaying the model summary

This part of code is to show the summary of model we built. All the specifications mentioned above has been shown in the below screenshot of the output.

model.summary()
Deep learning  model summary
Deep learning model summary

Training the model

We have used 10 rows of the training data set to check the model performance. As the result seems satisfactory so, we will proceed with the same model.

example_batch = normed_train_data[:10]
example_result = model.predict(example_batch)
example_result

Fitting the model

Now we will fit the model with 1000 epochs and store the model training and validation accuracy in the object named history.

EPOCHS = 1000

history = model.fit(
  normed_train_data, train_labels,
  epochs=EPOCHS, validation_split = 0.2, verbose=0,
  callbacks=[tfdocs.modeling.EpochDots()])
History of the model fit
History of the model fit

Here we will produce a glimpse of the history stats to understand how the training process progresses.

hist = pd.DataFrame(history.history)
hist['epoch'] = history.epoch
hist.tail()

Plotting the MAE score during the training process

As we are using 1000 epochs to train the model. It necessarily suggests that there are 1000 forward and backward passes while the model is trained. And we expect that with each passes the the loss will decrease and model’s prediction accuracy will increase as the training process progresses.

plotter = tfdocs.plots.HistoryPlotter(smoothing_std=2)
plotter.plot({'Basic': history}, metric = "mae")
plt.ylim([0, 10000])
plt.ylabel('MAE [sales]')

In the above plot, we can see that both the training and validation loss decreases in a exponential fashion with the increase in number of epochs.

test_predictions = model.predict(normed_test_data).flatten()
a = plt.axes(aspect='equal')
plt.scatter(test_labels, test_predictions)
plt.xlabel('True Values [sales]')
plt.ylabel('Predictions [sales]')
lims = [0, 40000]
plt.xlim(lims)
plt.ylim(lims)
_ = plt.plot(lims, lims)

Plotting the result

Here we have plotted the predicted sale prices against the true sale prices. And from the plot it is clear that the estimate is quite close to the original one.

Original Vs predicted values of sale price of cars
Original Vs predicted values of sale price of cars

Plotting the error

error = test_predictions - test_labels
plt.hist(error, bins = 125)
plt.xlabel("Prediction Error [Average_Fruit_fly_population]")
_ = plt.ylabel("Count")

Here we have plotted the error. Although the distribution of error is not a true Gaussian, but as the sample size increases, we can expect it will tend to a Gaussian distribution.

mae=metrics.mean_absolute_error(test_labels, test_predictions)
mse=metrics.mean_squared_error(test_labels, test_predictions)
# Printing the metrics
#print('Suppport Vector Regression Accuracy: ', lin_reg_pl.score(x_poly,y_train))
print('R2 square:',metrics.r2_score(test_labels, test_predictions))
print('MAE: ', mae)
print('MSE: ', mse)
Metrics of Deep learning models

Conclusion

So, here we can compare the performance of all the models using the metrics calculated. Let’s see all the models used to predict the car sale price together along with the metrics for the ease of comparison.

Model typeMAER Square
MLR28210.80
Decision Tree22110.84
Random Forest18170.88
Support Vector Machine72320
Deep learning/ANN27860.8
Comparison table for all the models used

From the table above, it is clear that for the present problem, the best performing model is Random Forest with the highest R square (Coefficient of Determination) and least MAE. But we have to keep in mind that the deep learning is also not far behind with respect to the metrics. And the beauty of deep learning is that with the increase in the training sample size, the accuracy of the model also increases.

Whereas in case of other models after a certain phase it attains a plateau in terms of model prediction accuracy. Even increasing training sample size also can not further improve the model’s performance. So, although deep learning occupies the third position in present situation, it has the potential to improve itself further if availability of training data is not a constrain.

If the data set is small and we need a good prediction for the response variable as the case here; it is a good idea to go for models like Random Forest or Decision tree. As they are capable of generating good prediction with lesser training data or labelled data.

So, finally it is the call of the researcher or modeler to select the best suited model judging his situation and field of knowledge. As different fields of science generate experimental data with distinct nature and a very good model of another field may fail completely to predict.

References