Comparing Machine Learning Algorithms (MLAs) are important to come out with the best-suited algorithm for a particular problem. This post discusses comparing different machine learning algorithms and how we can do this using scikit-learn package of python. You will learn how to compare multiple MLAs at a time using more than one fit statistics provided by scikit-learn and also creating plots to visualize the differences.
Machine Learning Algorithms (MLA) are very popular to solve different computational problems. Especially when the data set is huge and complex with no parameters known MLAs are like blessings to data scientists. The algorithms quickly analyze the data to learn the dependencies and relations between the variables and produce estimation with lot more accuracy than the conventional regression models.
Most common and frequently used machine learning models are supervised models. These models tend to learn about the data from experience. Its like the labelled data acts as teacher to train it to be perfect. As the training data size increases the model estimation gets more accurate.
Here are some recommended articles to know the basics of machine learning
- Machine learning with python: a detailed discussion
- Supervised machine learning
- Unsupervised machine learning
- Getting started with python and IDE
NB: Being a non-native English speaker, I always take extra care to proofread my articles with Grammarly. It is the best grammar and spellchecker available online. Read here my review of using Grammarly for more than two years.
Why we should compare machine learning algorithms
Other types of MLAs are the unsupervised and semi-supervised type which are helpful when the training data is not available and still we have to make some estimation. As these models are not trained using labelled data set naturally, these algorithms are not as accurate as supervised ones. But still, they have their own advantages.
All these MLAs are useful depending on situations and data types and to have the best estimation. That’s why selecting a particular MLA is essential to come with a good estimation. There are several parameters which we need to compare to judge the best model. After that, the best found model need to be tested on an independent data set for its performance. Visualization of the performance is also a good way to compare between the models quickly.
So, here we will compare most of the MLAs using resampling methods like cross validation technique using scikit-learn package of python. And then model fit statistics like accuracy, precision, recall value etc will be calculated for comparison. ROC (Receiver Operating Characteristic) curve is also a easy to understand process for MLA comparison; so finally in a single figure all ROCs will be put to for the ease of model comparison.
Data set used
The same data set used here for application of all the MLAs. The example dataset I have used here for demonstration purpose is from kaggle.com. The data collected by “National Institute of Diabetes and Digestive and Kidney Diseases” contains vital parameters of diabetes patients belong to Pima Indian heritage.
Here is a glimpse of the first ten rows of the data set:
The data set has independent variables as several physiological parameters of a diabetes patient. The dependent variable is if the patient is suffering from diabetes or not. Here the dependent column contains binary variable 1 indicating the person is suffering from diabetes and 0 he is not a patient of diabetes.
Code for comparing different machine learning algorithms
Lets jump to coding part. It is going to be a little lengthy code and a lot of MLAs will be compared. So, I have break down the complete code in segments. You can directly copy and pest the code and make little changes to suit your data.
Importing required packages
The first part is to load all the packages needed in this comparison. Besides the basic packages like pandas, numpy, matplotlib we will import some of the scikit-learn packages for application of the MLAs and their comparison.
#Importing basic packages import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns #Importing sklearn modules from sklearn.metrics import mean_squared_error,confusion_matrix, precision_score, recall_score, auc,roc_curve from sklearn import ensemble, linear_model, neighbors, svm, tree, neural_network from sklearn.linear_model import Ridge from sklearn.preprocessing import PolynomialFeatures from sklearn.model_selection import train_test_split, cross_val_score from sklearn.pipeline import make_pipeline from sklearn import svm,model_selection, tree, linear_model, neighbors, naive_bayes, ensemble, discriminant_analysis, gaussian_process from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn.neighbors import KNeighborsClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.naive_bayes import GaussianNB from sklearn.svm import SVC
Importing the data set and checking if there is any NULL values
This part of code will load the diabetes data set and check for any null values in the data frame.
#Loading the data and checking for missing values dataset=pd.read_csv('diabetes.csv') dataset.isnull().sum()
Checking the data set for any NULL values is very essential, as MLAs can not handle NULL values. We have to either eliminate the records with NULL values or replace them with the mean/median of the other values. we can see each of the variables are printed with number of null values. This data set has no null values so all are zero here.
Storing the independent and dependent variables
As we can see that the data frame contains nine variables in nine columns. The first eight columns contain the independent variables. These are some physiological variables having a correlation with diabetes symptoms. The ninth column shows if the patient is diabetic or not. So, here the x stores the independent variables and y stores the dependent variable diabetes count.
# Creating variables for analysis x=dataset.iloc[:,: -1] y=dataset.iloc[:,-1]
Splitting the data set
Here the data set has been divided into train and test data set. The test data set size is 20% of the total records. This test data will not be used in model training and work as an independent test data.
# Splitting train and split data x_train, x_test, y_train, y_test=train_test_split(x,y,test_size=0.2, random_state=0)
Storing machine learning algorithms (MLA) in a variable
Some very popular MLAs we have selected here for comparison and stored in a variable; so that they can be used at later part of the process. The MLAs first we have taken up for comparison are Logistic Regression, Linear Discriminant Analysis, K-nearest neighbour classifier, Decision tree classifier, Naive-Bayes classifier and Support Vector Machine.
# Application of all Machine Learning methods models =  models.append(('LR', LogisticRegression())) models.append(('LDA', LinearDiscriminantAnalysis())) models.append(('KNN', KNeighborsClassifier())) models.append(('CART', DecisionTreeClassifier())) models.append(('NB', GaussianNB())) models.append(('SVM', SVC()))
Creating a box plot to compare there accuracy
This part of code creates a box plot for all the models against their cross validation score.
# evaluate each model in turn results =  names =  scoring = 'accuracy' for name, model in models: kfold = model_selection.KFold(n_splits=10, random_state=seed) cv_results = model_selection.cross_val_score(model, x_train, y_train, cv=kfold, scoring=scoring) results.append(cv_results) names.append(name) msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std()) print(msg) # boxplot algorithm comparison fig = plt.figure() fig.suptitle('Comparison between different MLAs') ax = fig.add_subplot(111) plt.boxplot(results) ax.set_xticklabels(names) plt.show()
The cross validation score are printed below and it is clearly suggesting Logistic Regression and Linear Discriminant Analysis to the two most accurate MLAs.
Below is a box-whisker plot to visualize the same result.
Comparing all machine learning algorithms
# Application of all Machine Learning methods MLA = [ #GLM linear_model.LogisticRegressionCV(), linear_model.PassiveAggressiveClassifier(), linear_model. RidgeClassifierCV(), linear_model.SGDClassifier(), linear_model.Perceptron(), #Ensemble Methods ensemble.AdaBoostClassifier(), ensemble.BaggingClassifier(), ensemble.ExtraTreesClassifier(), ensemble.GradientBoostingClassifier(), ensemble.RandomForestClassifier(), #Gaussian Processes gaussian_process.GaussianProcessClassifier(), #SVM svm.SVC(probability=True), svm.NuSVC(probability=True), svm.LinearSVC(), #Trees tree.DecisionTreeClassifier(), #Navies Bayes naive_bayes.BernoulliNB(), naive_bayes.GaussianNB(), #Nearest Neighbor neighbors.KNeighborsClassifier(), ]
MLA_columns =  MLA_compare = pd.DataFrame(columns = MLA_columns) row_index = 0 for alg in MLA: predicted = alg.fit(x_train, y_train).predict(x_test) fp, tp, th = roc_curve(y_test, predicted) MLA_name = alg.__class__.__name__ MLA_compare.loc[row_index,'MLA used'] = MLA_name MLA_compare.loc[row_index, 'Train Accuracy'] = round(alg.score(x_train, y_train), 4) MLA_compare.loc[row_index, 'Test Accuracy'] = round(alg.score(x_test, y_test), 4) MLA_compare.loc[row_index, 'Precission'] = precision_score(y_test, predicted) MLA_compare.loc[row_index, 'Recall'] = recall_score(y_test, predicted) MLA_compare.loc[row_index, 'AUC'] = auc(fp, tp) row_index+=1 MLA_compare.sort_values(by = ['MLA Test Accuracy'], ascending = False, inplace = True) MLA_compare
# Creating plot to show the train accuracy plt.subplots(figsize=(13,5)) sns.barplot(x="MLA used", y="Train Accuracy",data=MLA_compare,palette='hot',edgecolor=sns.color_palette('dark',7)) plt.xticks(rotation=90) plt.title('MLA Train Accuracy Comparison') plt.show()
# Creating plot to show the test accuracy plt.subplots(figsize=(13,5)) sns.barplot(x="MLA used", y="Test Accuracy",data=MLA_compare,palette='hot',edgecolor=sns.color_palette('dark',7)) plt.xticks(rotation=90) plt.title('Accuraccy of different machine learning models') plt.show()
# Creating plots to compare precission of the MLAs plt.subplots(figsize=(13,5)) sns.barplot(x="MLA used", y="Precission",data=MLA_compare,palette='hot',edgecolor=sns.color_palette('dark',7)) plt.xticks(rotation=90) plt.title('Comparing different Machine Learning Models') plt.show()
# Creating plots for MLA recall comparison plt.subplots(figsize=(13,5)) sns.barplot(x="MLA used", y="Recall values",data=MLA_compare,palette='hot',edgecolor=sns.color_palette('dark',7)) plt.xticks(rotation=90) plt.title('MLA Recall Comparison') plt.show()
# Creating plot for MLA AUC comparison plt.subplots(figsize=(13,5)) sns.barplot(x="MLA used", y="AUC values",data=MLA_compare,palette='hot',edgecolor=sns.color_palette('dark',7)) plt.xticks(rotation=90) plt.title('MLA AUC Comparison') plt.show()
Creating ROC for all the models applied
Receiver Operating Characteristic (ROC) curve is a very important tool to diagnose the performance of MLAs by plotting the true positive rates against the false-positive rates at different threshold levels. The area under ROC curve often called AUC and it is also a good measure of the predictability of the machine learning algorithms. A higher AUC is an indication of more accurate prediction.
# Creating plot to show the ROC for all MLA index = 1 for alg in MLA: predicted = alg.fit(x_train, y_train).predict(x_test) fp, tp, th = roc_curve(y_test, predicted) roc_auc_mla = auc(fp, tp) MLA_name = alg.__class__.__name__ plt.plot(fp, tp, lw=2, alpha=0.3, label='ROC %s (AUC = %0.2f)' % (MLA_name, roc_auc_mla)) index+=1 plt.title('ROC Curve comparison') plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.) plt.plot([0,1],[0,1],'r--') plt.xlim([0,1]) plt.ylim([0,1]) plt.ylabel('True Positive Rate') plt.xlabel('False Positive Rate') plt.show()
This post presents a detailed discussion on how we can compare several machine learning algorithms at a time to fund out the best one. The comparison task has been completed using different functions of scikit-learn package of python. We took help of some popular fit statistics to draw a comparison between the models. Additionally, the Receiver Operating Characteristic (ROC) is also a good measure of comparing several MLAs.
I hope this guide will help you to conclude your problem in hand and to proceed with the best MLA chosen through a rigorous comparison method. Please feel free to try the python code given here, copy-pest the code in your python compiler, run and apply on your data. In case of any problem faced in executing the comparison process write me in the comment below.