machine learning models Archives

Comparing machine learning models for a regression problem

May 10, 2021July 11, 2020 by Dibyendu Deb

Comparing different machine learning models for a regression problem is necessary to find out which model is the most efficient and provide the most accurate result. There are many test criteria to compare the models. In this article, we will take a regression problem, fit different popular regression models and select the best one of them.

We have discussed how to compare different machine learning problems when we have a classification problem in hand (the article is here). That means in such cases the response variable is a categorical one. Different popular classification algorithms are compared to come out with the best algorithm.

NB: Being a non-native English speaker, I always take extra care to proofread my articles with Grammarly. It is the best grammar and spellchecker available online. Read here my review of using Grammarly for more than two years.

Try Grammarly here

Comparing regression models

So, what if the response variable is a continuous one and not categorical. This is a problem of regression then and we have to use regression models to estimate the predicted values. In this case also several candidate regression models. Our task is to find the one which serves our purpose.

So, in this article, we are taking a regression problem of predicting the value of a continuous variable. We will compare several regression models, compare their performance calculating the prediction accuracy and several goodnesses of fit statistics.

Here I have used five most prominent and popular regression models and compared them according to their prediction accuracy. The supervised models used here are

The models were compared using two very popular model comparison metrics namely Mean Absolute Error(MAE) and Mean Square Error (MSE). The expressions for these two metrics are as below:

Mean Absolute Error(MAE)

Comparing different machine learning models for a regression problem involves an important part of comparing original and estimated values. If $\dpi{150} y$ is the response variable and $\dpi{150} \widehat{y}$ is the estimate then MAE is the error between these $\dpi{150} n$ pair of variables and calculated with this equation:

MAE is a scale-dependent metric that means it has the same unit as the original variables. So, this is not a very reliable statistic when comparing models applied to different series with different units. It measures the mean of the absolute error between the true and estimated values of the same variable.

Mean Square Error (MSE)

This metric of model comparison is as the name suggests calculate the mean of the squares of the error between true and estimated values. So, the equation is as below:

Python code for comparing the models

So, now the comparison between different machine learning models is conducted using python. We will see step by step application of all the models and how their performance can be compared.

Loading required libraries

All the required libraries are first loaded here.

import numpy as np # linear algebra
import pandas as pd # data processing
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn import metrics
from pandas import DataFrame,Series
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
import matplotlib
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.model_selection import train_test_split,cross_val_score, cross_val_predict
import missingno as msno # plotting missing data
import seaborn as sns # plotting library
from sklearn import svm

The example data and its preprocessing

The data set used here is the car data set from Github and you can access the data file from this link. The data set has the following independent variables:

Age
Gender
Average miles driven per day
Personal debt and
Monthly income

Based on these independent variables we have to predict the potential sale value of a car. So, here the response variable is the sale value of the car and it is a continuous variable. That is why the problem in hand is a regression problem.

Importing the data

The below piece of code will use the panda library read() function to import the data set into the working space. The describe() function is for a brief idea about the data.

dataset = pd.read_csv("cars.csv")
dataset.describe()

Displaying the last few columns of the data set to have a glimpse of the data and variables.

Check the data for missing values

The following code is to check if there any missing value in the data set. Missing value creates a problem in the analysis process. So, we should filter these values in the data pre-processing stage. Here we will find out which columns contain missing values and the corresponding rows will be simply dropped from the data set.

# Finding all the columns with NULL values
dataset.isna().sum()
# Drop the rows with missing values
dataset = dataset.dropna()

Creating basic plots with the data

Here we create the joint distribution plot of the independent variables

sns.pairplot(dataset[['age', 'miles', 'debt', 'income', 'sales']], diag_kind="kde")

Splitting the data set

Data splitting is required to create training and testing data sets from the same car data. I have taken 80% of the whole data set as training data and the rest 20% of data as the test data set. The following python code is for this splitting purpose.

train_dataset = dataset.sample(frac=0.8,random_state=0)
test_dataset = dataset.drop(train_dataset.index)

Normalizing the training data set

First of all we will see the summary statistics of all the variables using the describe() function of sklearn library.

# Calculating basic statistics with the train data
train_stats = train_dataset.describe()
train_stats.pop("sales") # excluding the dependent variable
train_stats = train_stats.transpose()
train_stats

Here from the below stats about the data set, we can see that different variables in the data set has very large range and deviations, which may create problem during model fitting. So, before we use this variables in model building process, we will normalize the variables.

Summary statistics of the training data set

Creating a function for normalization

Using the mean and standard deviation of each of the variables, we will convert them into standard normal variates. For that purpose, we will create the function as in below.

# Creating the normalizing function with mean and standard deviation
def norm(x):
  return (x - train_stats['mean']) / train_stats['std']
normed_train_data = norm(train_dataset)
normed_test_data = norm(test_dataset)

Separating the response variable and creating other variables

Now a most important step to store the response variable in a separate variable.

train_labels = train_dataset.pop("sales") # using .pop function to store only the dependent variable
test_labels = test_dataset.pop("sales")
x_train=normed_train_data
x_test=normed_test_data
y_train=train_labels
y_test=test_labels

As we are now finished with the data pre-processing stage, we will start with the modelling steps. So, let’s start coding for all the five models I have mentioned to predict the car sale price.

Application of Multiple Linear Regression

First of all Multiple Linear Regression (MLR). This simple linear regression only but we will include all the independent variables to estimate the car sale price. The LinearRegression() function from LinearModel module of sklearn library has been used here for the purpose.

lin_reg = LinearRegression()
lin_reg.fit(x_train,y_train)
#Prediction using test set 
y_pred = lin_reg.predict(x_test)
mae=metrics.mean_absolute_error(y_test, y_pred)
mse=metrics.mean_squared_error(y_test, y_pred)
# Printing the metrics
print('R2 square:',metrics.r2_score(y_test, y_pred))
print('MAE: ', mae)
print('MSE: ', mse)

Application of Decision Tree regression

dt_regressor = DecisionTreeRegressor(random_state = 0)
dt_regressor.fit(x_train,y_train)
#Predicting using test set 
y_pred = dt_regressor.predict(x_test)
mae=metrics.mean_absolute_error(y_test, y_pred)
mse=metrics.mean_squared_error(y_test, y_pred)
# Printing the metrics
print('Suppport Vector Regression Accuracy: ', dt_regressor.score(x_test,y_test))
print('R2 square:',metrics.r2_score(y_test, y_pred))
print('MAE: ', mae)
print('MSE: ', mse)

Application of Random Forest Regression

rf_regressor = RandomForestRegressor(n_estimators = 300 ,  random_state = 0)
rf_regressor.fit(x_train,y_train)
#Predicting the SalePrices using test set 
y_pred = rf_regressor.predict(x_test)
mae=metrics.mean_absolute_error(y_test, y_pred)
mse=metrics.mean_squared_error(y_test, y_pred)
# Printing the metrics
print('Suppport Vector Regression Accuracy: ', rf_regressor.score(x_test,y_test))
print('R2 square:',metrics.r2_score(y_test, y_pred))
print('MAE: ', mae)
print('MSE: ', mse)

Application of Support Vector Regression

from sklearn.svm import SVR
regressor= SVR(kernel='rbf')
regressor.fit(x_train,y_train)
y_pred_svm=regressor.predict(x_test)
#y_pred_svm = cross_val_predict(regressor, x, y)
mae=metrics.mean_absolute_error(y_test, y_pred_svm)
mse=metrics.mean_squared_error(y_test, y_pred_svm)
# Printing the metrics
print('Suppport Vector Regression Accuracy: ', regressor.score(x_test,y_test))
print('R2 square:',metrics.r2_score(y_test, y_pred_svm))
print('MAE: ', mae)
print('MSE: ', mse)

Application of Deep Learning using Keras library

Here is the deep learning model mentioned. A sequential model has been used. The model has been created as a function named build_model so that we can call it anytime it is required in the process. The model has two connected hidden layers with a Rectified Linear Unit (relu) function and an output layer with a linear function.

The hidden layers have 12 and 8 neurons respectively with all the 8 input variables. Mean Squared Error is the loss function here as it is the most common loss function in case of regression problems.

def build_model():
  model = keras.Sequential([
    layers.Dense(12,kernel_initializer='normal', activation='relu', input_shape=[len(train_dataset.keys())]),
    layers.Dense(8, activation='relu'),
    layers.Dense(1, activation='linear')
  ])

  optimizer = tf.keras.optimizers.RMSprop(0.001)

  model.compile(loss='mse',
                optimizer=optimizer,
                metrics=['mae', 'mse'])
  return model

model = build_model()

Displaying the model summary

This part of code is to show the summary of model we built. All the specifications mentioned above has been shown in the below screenshot of the output.

model.summary()

Training the model

We have used 10 rows of the training data set to check the model performance. As the result seems satisfactory so, we will proceed with the same model.

example_batch = normed_train_data[:10]
example_result = model.predict(example_batch)
example_result

Fitting the model

Now we will fit the model with 1000 epochs and store the model training and validation accuracy in the object named history.

EPOCHS = 1000

history = model.fit(
  normed_train_data, train_labels,
  epochs=EPOCHS, validation_split = 0.2, verbose=0,
  callbacks=[tfdocs.modeling.EpochDots()])

Here we will produce a glimpse of the history stats to understand how the training process progresses.

hist = pd.DataFrame(history.history)
hist['epoch'] = history.epoch
hist.tail()

Plotting the MAE score during the training process

As we are using 1000 epochs to train the model. It necessarily suggests that there are 1000 forward and backward passes while the model is trained. And we expect that with each passes the the loss will decrease and model’s prediction accuracy will increase as the training process progresses.

plotter = tfdocs.plots.HistoryPlotter(smoothing_std=2)

plotter.plot({'Basic': history}, metric = "mae")
plt.ylim([0, 10000])
plt.ylabel('MAE [sales]')

In the above plot, we can see that both the training and validation loss decreases in a exponential fashion with the increase in number of epochs.

test_predictions = model.predict(normed_test_data).flatten()
a = plt.axes(aspect='equal')
plt.scatter(test_labels, test_predictions)
plt.xlabel('True Values [sales]')
plt.ylabel('Predictions [sales]')
lims = [0, 40000]
plt.xlim(lims)
plt.ylim(lims)
_ = plt.plot(lims, lims)

Plotting the result

Here we have plotted the predicted sale prices against the true sale prices. And from the plot it is clear that the estimate is quite close to the original one.

Original Vs predicted values of sale price of cars

Plotting the error

error = test_predictions - test_labels
plt.hist(error, bins = 125)
plt.xlabel("Prediction Error [Average_Fruit_fly_population]")
_ = plt.ylabel("Count")

Here we have plotted the error. Although the distribution of error is not a true Gaussian, but as the sample size increases, we can expect it will tend to a Gaussian distribution.

mae=metrics.mean_absolute_error(test_labels, test_predictions)
mse=metrics.mean_squared_error(test_labels, test_predictions)
# Printing the metrics
#print('Suppport Vector Regression Accuracy: ', lin_reg_pl.score(x_poly,y_train))
print('R2 square:',metrics.r2_score(test_labels, test_predictions))
print('MAE: ', mae)
print('MSE: ', mse)

Conclusion

So, here we can compare the performance of all the models using the metrics calculated. Let’s see all the models used to predict the car sale price together along with the metrics for the ease of comparison.

Model type	MAE	R Square
MLR	2821	0.80
Decision Tree	2211	0.84
Random Forest	1817	0.88
Support Vector Machine	7232	0
Deep learning/ANN	2786	0.8

Comparison table for all the models used

From the table above, it is clear that for the present problem, the best performing model is Random Forest with the highest R square (Coefficient of Determination) and least MAE. But we have to keep in mind that the deep learning is also not far behind with respect to the metrics. And the beauty of deep learning is that with the increase in the training sample size, the accuracy of the model also increases.

Whereas in case of other models after a certain phase it attains a plateau in terms of model prediction accuracy. Even increasing training sample size also can not further improve the model’s performance. So, although deep learning occupies the third position in present situation, it has the potential to improve itself further if availability of training data is not a constrain.

If the data set is small and we need a good prediction for the response variable as the case here; it is a good idea to go for models like Random Forest or Decision tree. As they are capable of generating good prediction with lesser training data or labelled data.

So, finally it is the call of the researcher or modeler to select the best suited model judging his situation and field of knowledge. As different fields of science generate experimental data with distinct nature and a very good model of another field may fail completely to predict.

References

https://www.tensorflow.org/
https://www.deeplearning.ai
https://machinelearningmastery.com
https://datascienceplus.com

Comparing the performance of different machine learning algorithms

May 10, 2021June 15, 2020 by Dibyendu Deb

Comparing Machine Learning Algorithms (MLAs) are important to come out with the best-suited algorithm for a particular problem. This post discusses comparing different machine learning algorithms and how we can do this using scikit-learn package of python. You will learn how to compare multiple MLAs at a time using more than one fit statistics provided by scikit-learn and also creating plots to visualize the differences.

Machine Learning Algorithms (MLA) are very popular to solve different computational problems. Especially when the data set is huge and complex with no parameters known MLAs are like blessings to data scientists. The algorithms quickly analyze the data to learn the dependencies and relations between the variables and produce estimation with lot more accuracy than the conventional regression models.

Most common and frequently used machine learning models are supervised models. These models tend to learn about the data from experience. Its like the labelled data acts as teacher to train it to be perfect. As the training data size increases the model estimation gets more accurate.

Here are some recommended articles to know the basics of machine learning

NB: Being a non-native English speaker, I always take extra care to proofread my articles with Grammarly. It is the best grammar and spellchecker available online. Read here my review of using Grammarly for more than two years.

Try Grammarly for free here

Why we should compare machine learning algorithms

Other types of MLAs are the unsupervised and semi-supervised type which are helpful when the training data is not available and still we have to make some estimation. As these models are not trained using labelled data set naturally, these algorithms are not as accurate as supervised ones. But still, they have their own advantages.

All these MLAs are useful depending on situations and data types and to have the best estimation. That’s why selecting a particular MLA is essential to come with a good estimation. There are several parameters which we need to compare to judge the best model. After that, the best found model need to be tested on an independent data set for its performance. Visualization of the performance is also a good way to compare between the models quickly.

So, here we will compare most of the MLAs using resampling methods like cross validation technique using scikit-learn package of python. And then model fit statistics like accuracy, precision, recall value etc will be calculated for comparison. ROC (Receiver Operating Characteristic) curve is also a easy to understand process for MLA comparison; so finally in a single figure all ROCs will be put to for the ease of model comparison.

Data set used

The same data set used here for application of all the MLAs. The example dataset I have used here for demonstration purpose is from kaggle.com. The data collected by “National Institute of Diabetes and Digestive and Kidney Diseases” contains vital parameters of diabetes patients belong to Pima Indian heritage.

Here is a glimpse of the first ten rows of the data set:

Diabetes data set for logistic regression — Diabetes data set for ANN

The data set has independent variables as several physiological parameters of a diabetes patient. The dependent variable is if the patient is suffering from diabetes or not. Here the dependent column contains binary variable 1 indicating the person is suffering from diabetes and 0 he is not a patient of diabetes.

Code for comparing different machine learning algorithms

Lets jump to coding part. It is going to be a little lengthy code and a lot of MLAs will be compared. So, I have break down the complete code in segments. You can directly copy and pest the code and make little changes to suit your data.

Importing required packages

The first part is to load all the packages needed in this comparison. Besides the basic packages like pandas, numpy, matplotlib we will import some of the scikit-learn packages for application of the MLAs and their comparison.

#Importing basic packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#Importing sklearn modules
from sklearn.metrics import mean_squared_error,confusion_matrix, precision_score, recall_score, auc,roc_curve
from sklearn import ensemble, linear_model, neighbors, svm, tree, neural_network
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn import svm,model_selection, tree, linear_model, neighbors, naive_bayes, ensemble, discriminant_analysis, gaussian_process
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

Importing the data set and checking if there is any NULL values

This part of code will load the diabetes data set and check for any null values in the data frame.

#Loading the data and checking for missing values
dataset=pd.read_csv('diabetes.csv')
dataset.isnull().sum()

Checking the data set for any NULL values is very essential, as MLAs can not handle NULL values. We have to either eliminate the records with NULL values or replace them with the mean/median of the other values. we can see each of the variables are printed with number of null values. This data set has no null values so all are zero here.

Storing the independent and dependent variables

As we can see that the data frame contains nine variables in nine columns. The first eight columns contain the independent variables. These are some physiological variables having a correlation with diabetes symptoms. The ninth column shows if the patient is diabetic or not. So, here the x stores the independent variables and y stores the dependent variable diabetes count.

# Creating variables for analysis
x=dataset.iloc[:,: -1]
y=dataset.iloc[:,-1]

Splitting the data set

Here the data set has been divided into train and test data set. The test data set size is 20% of the total records. This test data will not be used in model training and work as an independent test data.

# Splitting train and split data
x_train, x_test, y_train, y_test=train_test_split(x,y,test_size=0.2, random_state=0)

Storing machine learning algorithms (MLA) in a variable

Some very popular MLAs we have selected here for comparison and stored in a variable; so that they can be used at later part of the process. The MLAs first we have taken up for comparison are Logistic Regression, Linear Discriminant Analysis, K-nearest neighbour classifier, Decision tree classifier, Naive-Bayes classifier and Support Vector Machine.

# Application of all Machine Learning methods
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))

Creating a box plot to compare there accuracy

This part of code creates a box plot for all the models against their cross validation score.

# evaluate each model in turn
results = []
names = []
scoring = 'accuracy'
for name, model in models:
	kfold = model_selection.KFold(n_splits=10, random_state=seed)
	cv_results = model_selection.cross_val_score(model, x_train, y_train, cv=kfold, scoring=scoring)
	results.append(cv_results)
	names.append(name)
	msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
	print(msg)
# boxplot algorithm comparison
fig = plt.figure()
fig.suptitle('Comparison between different MLAs')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

The cross validation score are printed below and it is clearly suggesting Logistic Regression and Linear Discriminant Analysis to the two most accurate MLAs.

Below is a box-whisker plot to visualize the same result.

Comparing all machine learning algorithms

# Application of all Machine Learning methods
MLA = [
    #GLM
    linear_model.LogisticRegressionCV(),
    linear_model.PassiveAggressiveClassifier(),
    linear_model. RidgeClassifierCV(),
    linear_model.SGDClassifier(),
    linear_model.Perceptron(),
    
    #Ensemble Methods
    ensemble.AdaBoostClassifier(),
    ensemble.BaggingClassifier(),
    ensemble.ExtraTreesClassifier(),
    ensemble.GradientBoostingClassifier(),
    ensemble.RandomForestClassifier(),

    #Gaussian Processes
    gaussian_process.GaussianProcessClassifier(),
    
    #SVM
    svm.SVC(probability=True),
    svm.NuSVC(probability=True),
    svm.LinearSVC(),
    
    #Trees    
    tree.DecisionTreeClassifier(),
  
    #Navies Bayes
    naive_bayes.BernoulliNB(),
    naive_bayes.GaussianNB(),
    
    #Nearest Neighbor
    neighbors.KNeighborsClassifier(),
    ]

MLA_columns = []
MLA_compare = pd.DataFrame(columns = MLA_columns)

row_index = 0
for alg in MLA:  
    
    predicted = alg.fit(x_train, y_train).predict(x_test)
    fp, tp, th = roc_curve(y_test, predicted)
    MLA_name = alg.__class__.__name__
    MLA_compare.loc[row_index,'MLA used'] = MLA_name
    MLA_compare.loc[row_index, 'Train Accuracy'] = round(alg.score(x_train, y_train), 4)
    MLA_compare.loc[row_index, 'Test Accuracy'] = round(alg.score(x_test, y_test), 4)
    MLA_compare.loc[row_index, 'Precission'] = precision_score(y_test, predicted)
    MLA_compare.loc[row_index, 'Recall'] = recall_score(y_test, predicted)
    MLA_compare.loc[row_index, 'AUC'] = auc(fp, tp)

    row_index+=1
    
MLA_compare.sort_values(by = ['MLA Test Accuracy'], ascending = False, inplace = True)    
MLA_compare

Comparison of all machine learning algorithms

# Creating plot to show the train accuracy
plt.subplots(figsize=(13,5))
sns.barplot(x="MLA used", y="Train Accuracy",data=MLA_compare,palette='hot',edgecolor=sns.color_palette('dark',7))
plt.xticks(rotation=90)
plt.title('MLA Train Accuracy Comparison')
plt.show()

# Creating plot to show the test accuracy
plt.subplots(figsize=(13,5))
sns.barplot(x="MLA used", y="Test Accuracy",data=MLA_compare,palette='hot',edgecolor=sns.color_palette('dark',7))
plt.xticks(rotation=90)
plt.title('Accuraccy of different machine learning models')
plt.show()

Accuracy of different machine learning algorithms

# Creating plots to compare precission of the MLAs
plt.subplots(figsize=(13,5))
sns.barplot(x="MLA used", y="Precission",data=MLA_compare,palette='hot',edgecolor=sns.color_palette('dark',7))
plt.xticks(rotation=90)
plt.title('Comparing different Machine Learning Models')
plt.show()

Comparing different machine learning algorithms

# Creating plots for MLA recall comparison
plt.subplots(figsize=(13,5))
sns.barplot(x="MLA used", y="Recall values",data=MLA_compare,palette='hot',edgecolor=sns.color_palette('dark',7))
plt.xticks(rotation=90)
plt.title('MLA Recall Comparison')
plt.show()

Recall comparison of all Machine learning algorithms

# Creating plot for MLA AUC comparison
plt.subplots(figsize=(13,5))
sns.barplot(x="MLA used", y="AUC values",data=MLA_compare,palette='hot',edgecolor=sns.color_palette('dark',7))
plt.xticks(rotation=90)
plt.title('MLA AUC Comparison')
plt.show()

Creating ROC for all the models applied

Receiver Operating Characteristic (ROC) curve is a very important tool to diagnose the performance of MLAs by plotting the true positive rates against the false-positive rates at different threshold levels. The area under ROC curve often called AUC and it is also a good measure of the predictability of the machine learning algorithms. A higher AUC is an indication of more accurate prediction.

# Creating plot to show the ROC for all MLA
index = 1
for alg in MLA:
    
    
    predicted = alg.fit(x_train, y_train).predict(x_test)
    fp, tp, th = roc_curve(y_test, predicted)
    roc_auc_mla = auc(fp, tp)
    MLA_name = alg.__class__.__name__
    plt.plot(fp, tp, lw=2, alpha=0.3, label='ROC %s (AUC = %0.2f)'  % (MLA_name, roc_auc_mla))
   
    index+=1

plt.title('ROC Curve comparison')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.plot([0,1],[0,1],'r--')
plt.xlim([0,1])
plt.ylim([0,1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')    
plt.show()

Conclusion

This post presents a detailed discussion on how we can compare several machine learning algorithms at a time to fund out the best one. The comparison task has been completed using different functions of scikit-learn package of python. We took help of some popular fit statistics to draw a comparison between the models. Additionally, the Receiver Operating Characteristic (ROC) is also a good measure of comparing several MLAs.

I hope this guide will help you to conclude your problem in hand and to proceed with the best MLA chosen through a rigorous comparison method. Please feel free to try the python code given here, copy-pest the code in your python compiler, run and apply on your data. In case of any problem faced in executing the comparison process write me in the comment below.