comparing machine learning models using python Archives

Comparing Machine Learning Algorithms (MLAs) are important to come out with the best-suited algorithm for a particular problem. This post discusses comparing different machine learning algorithms and how we can do this using scikit-learn package of python. You will learn how to compare multiple MLAs at a time using more than one fit statistics provided by scikit-learn and also creating plots to visualize the differences.

Machine Learning Algorithms (MLA) are very popular to solve different computational problems. Especially when the data set is huge and complex with no parameters known MLAs are like blessings to data scientists. The algorithms quickly analyze the data to learn the dependencies and relations between the variables and produce estimation with lot more accuracy than the conventional regression models.

Most common and frequently used machine learning models are supervised models. These models tend to learn about the data from experience. Its like the labelled data acts as teacher to train it to be perfect. As the training data size increases the model estimation gets more accurate.

Here are some recommended articles to know the basics of machine learning

NB: Being a non-native English speaker, I always take extra care to proofread my articles with Grammarly. It is the best grammar and spellchecker available online. Read here my review of using Grammarly for more than two years.

Try Grammarly for free here

Why we should compare machine learning algorithms

Other types of MLAs are the unsupervised and semi-supervised type which are helpful when the training data is not available and still we have to make some estimation. As these models are not trained using labelled data set naturally, these algorithms are not as accurate as supervised ones. But still, they have their own advantages.

All these MLAs are useful depending on situations and data types and to have the best estimation. That’s why selecting a particular MLA is essential to come with a good estimation. There are several parameters which we need to compare to judge the best model. After that, the best found model need to be tested on an independent data set for its performance. Visualization of the performance is also a good way to compare between the models quickly.

So, here we will compare most of the MLAs using resampling methods like cross validation technique using scikit-learn package of python. And then model fit statistics like accuracy, precision, recall value etc will be calculated for comparison. ROC (Receiver Operating Characteristic) curve is also a easy to understand process for MLA comparison; so finally in a single figure all ROCs will be put to for the ease of model comparison.

Data set used

The same data set used here for application of all the MLAs. The example dataset I have used here for demonstration purpose is from kaggle.com. The data collected by “National Institute of Diabetes and Digestive and Kidney Diseases” contains vital parameters of diabetes patients belong to Pima Indian heritage.

Here is a glimpse of the first ten rows of the data set:

Diabetes data set for logistic regression — Diabetes data set for ANN

The data set has independent variables as several physiological parameters of a diabetes patient. The dependent variable is if the patient is suffering from diabetes or not. Here the dependent column contains binary variable 1 indicating the person is suffering from diabetes and 0 he is not a patient of diabetes.

Code for comparing different machine learning algorithms

Lets jump to coding part. It is going to be a little lengthy code and a lot of MLAs will be compared. So, I have break down the complete code in segments. You can directly copy and pest the code and make little changes to suit your data.

Importing required packages

The first part is to load all the packages needed in this comparison. Besides the basic packages like pandas, numpy, matplotlib we will import some of the scikit-learn packages for application of the MLAs and their comparison.

#Importing basic packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#Importing sklearn modules
from sklearn.metrics import mean_squared_error,confusion_matrix, precision_score, recall_score, auc,roc_curve
from sklearn import ensemble, linear_model, neighbors, svm, tree, neural_network
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn import svm,model_selection, tree, linear_model, neighbors, naive_bayes, ensemble, discriminant_analysis, gaussian_process
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

Importing the data set and checking if there is any NULL values

This part of code will load the diabetes data set and check for any null values in the data frame.

#Loading the data and checking for missing values
dataset=pd.read_csv('diabetes.csv')
dataset.isnull().sum()

Checking the data set for any NULL values is very essential, as MLAs can not handle NULL values. We have to either eliminate the records with NULL values or replace them with the mean/median of the other values. we can see each of the variables are printed with number of null values. This data set has no null values so all are zero here.

Storing the independent and dependent variables

As we can see that the data frame contains nine variables in nine columns. The first eight columns contain the independent variables. These are some physiological variables having a correlation with diabetes symptoms. The ninth column shows if the patient is diabetic or not. So, here the x stores the independent variables and y stores the dependent variable diabetes count.

# Creating variables for analysis
x=dataset.iloc[:,: -1]
y=dataset.iloc[:,-1]

Splitting the data set

Here the data set has been divided into train and test data set. The test data set size is 20% of the total records. This test data will not be used in model training and work as an independent test data.

# Splitting train and split data
x_train, x_test, y_train, y_test=train_test_split(x,y,test_size=0.2, random_state=0)

Storing machine learning algorithms (MLA) in a variable

Some very popular MLAs we have selected here for comparison and stored in a variable; so that they can be used at later part of the process. The MLAs first we have taken up for comparison are Logistic Regression, Linear Discriminant Analysis, K-nearest neighbour classifier, Decision tree classifier, Naive-Bayes classifier and Support Vector Machine.

# Application of all Machine Learning methods
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))

Creating a box plot to compare there accuracy

This part of code creates a box plot for all the models against their cross validation score.

# evaluate each model in turn
results = []
names = []
scoring = 'accuracy'
for name, model in models:
	kfold = model_selection.KFold(n_splits=10, random_state=seed)
	cv_results = model_selection.cross_val_score(model, x_train, y_train, cv=kfold, scoring=scoring)
	results.append(cv_results)
	names.append(name)
	msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
	print(msg)
# boxplot algorithm comparison
fig = plt.figure()
fig.suptitle('Comparison between different MLAs')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

The cross validation score are printed below and it is clearly suggesting Logistic Regression and Linear Discriminant Analysis to the two most accurate MLAs.

Below is a box-whisker plot to visualize the same result.

Comparing all machine learning algorithms

# Application of all Machine Learning methods
MLA = [
    #GLM
    linear_model.LogisticRegressionCV(),
    linear_model.PassiveAggressiveClassifier(),
    linear_model. RidgeClassifierCV(),
    linear_model.SGDClassifier(),
    linear_model.Perceptron(),
    
    #Ensemble Methods
    ensemble.AdaBoostClassifier(),
    ensemble.BaggingClassifier(),
    ensemble.ExtraTreesClassifier(),
    ensemble.GradientBoostingClassifier(),
    ensemble.RandomForestClassifier(),

    #Gaussian Processes
    gaussian_process.GaussianProcessClassifier(),
    
    #SVM
    svm.SVC(probability=True),
    svm.NuSVC(probability=True),
    svm.LinearSVC(),
    
    #Trees    
    tree.DecisionTreeClassifier(),
  
    #Navies Bayes
    naive_bayes.BernoulliNB(),
    naive_bayes.GaussianNB(),
    
    #Nearest Neighbor
    neighbors.KNeighborsClassifier(),
    ]

MLA_columns = []
MLA_compare = pd.DataFrame(columns = MLA_columns)

row_index = 0
for alg in MLA:  
    
    predicted = alg.fit(x_train, y_train).predict(x_test)
    fp, tp, th = roc_curve(y_test, predicted)
    MLA_name = alg.__class__.__name__
    MLA_compare.loc[row_index,'MLA used'] = MLA_name
    MLA_compare.loc[row_index, 'Train Accuracy'] = round(alg.score(x_train, y_train), 4)
    MLA_compare.loc[row_index, 'Test Accuracy'] = round(alg.score(x_test, y_test), 4)
    MLA_compare.loc[row_index, 'Precission'] = precision_score(y_test, predicted)
    MLA_compare.loc[row_index, 'Recall'] = recall_score(y_test, predicted)
    MLA_compare.loc[row_index, 'AUC'] = auc(fp, tp)

    row_index+=1
    
MLA_compare.sort_values(by = ['MLA Test Accuracy'], ascending = False, inplace = True)    
MLA_compare

Comparison of all machine learning algorithms

# Creating plot to show the train accuracy
plt.subplots(figsize=(13,5))
sns.barplot(x="MLA used", y="Train Accuracy",data=MLA_compare,palette='hot',edgecolor=sns.color_palette('dark',7))
plt.xticks(rotation=90)
plt.title('MLA Train Accuracy Comparison')
plt.show()

# Creating plot to show the test accuracy
plt.subplots(figsize=(13,5))
sns.barplot(x="MLA used", y="Test Accuracy",data=MLA_compare,palette='hot',edgecolor=sns.color_palette('dark',7))
plt.xticks(rotation=90)
plt.title('Accuraccy of different machine learning models')
plt.show()

Accuracy of different machine learning algorithms

# Creating plots to compare precission of the MLAs
plt.subplots(figsize=(13,5))
sns.barplot(x="MLA used", y="Precission",data=MLA_compare,palette='hot',edgecolor=sns.color_palette('dark',7))
plt.xticks(rotation=90)
plt.title('Comparing different Machine Learning Models')
plt.show()

Comparing different machine learning algorithms

# Creating plots for MLA recall comparison
plt.subplots(figsize=(13,5))
sns.barplot(x="MLA used", y="Recall values",data=MLA_compare,palette='hot',edgecolor=sns.color_palette('dark',7))
plt.xticks(rotation=90)
plt.title('MLA Recall Comparison')
plt.show()

Recall comparison of all Machine learning algorithms

# Creating plot for MLA AUC comparison
plt.subplots(figsize=(13,5))
sns.barplot(x="MLA used", y="AUC values",data=MLA_compare,palette='hot',edgecolor=sns.color_palette('dark',7))
plt.xticks(rotation=90)
plt.title('MLA AUC Comparison')
plt.show()

Creating ROC for all the models applied

Receiver Operating Characteristic (ROC) curve is a very important tool to diagnose the performance of MLAs by plotting the true positive rates against the false-positive rates at different threshold levels. The area under ROC curve often called AUC and it is also a good measure of the predictability of the machine learning algorithms. A higher AUC is an indication of more accurate prediction.

# Creating plot to show the ROC for all MLA
index = 1
for alg in MLA:
    
    
    predicted = alg.fit(x_train, y_train).predict(x_test)
    fp, tp, th = roc_curve(y_test, predicted)
    roc_auc_mla = auc(fp, tp)
    MLA_name = alg.__class__.__name__
    plt.plot(fp, tp, lw=2, alpha=0.3, label='ROC %s (AUC = %0.2f)'  % (MLA_name, roc_auc_mla))
   
    index+=1

plt.title('ROC Curve comparison')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.plot([0,1],[0,1],'r--')
plt.xlim([0,1])
plt.ylim([0,1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')    
plt.show()

Conclusion

This post presents a detailed discussion on how we can compare several machine learning algorithms at a time to fund out the best one. The comparison task has been completed using different functions of scikit-learn package of python. We took help of some popular fit statistics to draw a comparison between the models. Additionally, the Receiver Operating Characteristic (ROC) is also a good measure of comparing several MLAs.

I hope this guide will help you to conclude your problem in hand and to proceed with the best MLA chosen through a rigorous comparison method. Please feel free to try the python code given here, copy-pest the code in your python compiler, run and apply on your data. In case of any problem faced in executing the comparison process write me in the comment below.

Comparing the performance of different machine learning algorithms