Naive Bayes classifier application using python

Naive Bayes classifier application using python

The Naive Bayes classifier is very straight forward, easy and fast working machine learning technique. It is one of the most popular supervised machine learning techniques to classify data set with high dimensionality. In this article, you will get a thorough idea about how this algorithm works and also a step by step implementation with python. Naive Bayes’ actually a simplified form of Bayes’ theorem so we will cover that too.

Under Bayes’ theorem, no theory is perfect, Rather, it is a work in progress, always subject to further refinement and testing.” ~ Nate Silver

In real life application of classification problem is everywhere. We are taking different decisions in our daily life judging probability of several other factors either consciously or unconsciously. When we are in need to analyse large data and take a decision on its basis, we need some tool. Naive Bayes classifier is the simplest and very fast supervised learning algorithm which is also accurate enough. So, it can make our life far easier  in taking vital decisions.

The concept of Bayes’ theorem

To know the Naive Bayes’ classification concept we have to understand the Bayes’ theorem first. A Bayesian classification describes the relationship between conditional probabilities of different events. This theorem calculates the probability of any hypothesis provided the information of any event. 

For example, we the cricket lovers try to guess whether we will be able to play today depending on the weather variables. A banker tries to make sure if the customer is risky to give a credit depending on his financial transaction history or a businessman tries to judge whether his newly launched product is going to be a hit or flop among the customer depending on the customers buying behaviour.

This type of model dealing with conditional probabilities is called generative models. They are generative because of the fact they actually specify the hypothetical random process of data generation. But the training of such generative models for each event is really very difficult task. 

So how to tackle this issue? Here comes the concept of Naive Bayes’ classifier. The name Naive because it assumes some very simple things about the Bayes’ model. Like the presence of any feature in any class does not depend on any other feature. It simply overlooks the relationship between the features and considers that all the features independently contributes toward the target variable.

[docxpresso file=”https://dibyendudeb.com/wp-content/uploads/2020/06/Naive-Bayes-classifier-application-using-python.odt”]

In the data set, the feature variable is test report having values as positive and negative whereas the binomial target variable is “sick” with values as “yes” or “no”. Let us assume the data set has 20 cases of test results which are as below:

The data set

Creating a frequency table of the attributes of the data set

So if we create the frequency table for the above data set it will look like this

The frequency table

With the help of this frequency table, we can now prepare the likelihood table to calculate prior and posterior probabilities. See the below figure.

Calculating probabilities using likelihood tables
Calculating probabilities using likelihood tables

With the help of this above table, we can now calculate what is the probability that a person is really suffering from a disease when his test report was also positive.

So the probability we want to compute is 

P(yes|positive)=P(positive|yes)P(yes)P(positive)

We have already calculated the probabilities, so, we can directly put the values in the above equation and get the probability we want to calculate.

P(yes|positive)=0.73×0.550.55=0.73

In the same fashion, we can also calculate the probability of a person of not having a disease in spite of the test report being positive.

P(no|positive)=P(positive|no)P(no)P(positive)=0.33×0.450.55=0.27

Application of Naive Bayes’ classification with python

Now the most interesting part of the article is here. We will implement Naive Bayes’ classification using python. To do that we will use the popular scikit-learn library and its functions. 

About the data

We will take the same diabetes data we have used earlier in other classification problem. 

The purpose of using the same data for all classification problems is to make you able to compare between different algorithms. You can judge the accuracy of each algorithm with their accuracies in classifying the data.

So, here the target variable has two classes that is if the person has diabetes or not. On the other hand, we have 9 independent or feature variables influencing the target variable.

Importing required libraries

The first step to start coding is to import all the libraries we are going to use. The basic libraries for any kind of data science projects are like pandas, numpy, matplotlib etc. The purpose of these libraries has an elaborate discussion in the article simple linear regression with python.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.naive_bayes import GaussianNB
import seaborn as sns 

About the data

The example dataset I have used here for demonstration purpose is from kaggle.com. The data collected by “National Institute of Diabetes and Digestive and Kidney Diseases”  contains vital parameters of diabetes patients belong to Pima Indian heritage.

Here is a glimpse of the first ten rows of the data set:

Diabetes data set for logistic regression
Diabetes data set for ANN

The data set has independent variables as several physiological parameters of a diabetes patient. The dependent variable is if the patient is suffering from diabetes or not. Here the dependent column contains binary variable 1 indicating the person is suffering from diabetes and 0 he is not a patient of diabetes.

dataset=pd.read_csv('diabetes.csv')
dataset.shape
dataset.head()
This image has an empty alt attribute; its file name is image-50.png
# Printing data details
print(dataset.info) # for a quick view of the data
print(dataset.head) # printing first few rows of the data
dataset.tail        # to show last few rows of the data
dataset.sample(10)  # display a sample of 10 rows from the data
dataset.describe    # printing summary statistics of the data
pd.isnull(dataset)  # check for any null values in the data
This image has an empty alt attribute; its file name is image-52.png
Checking if the dataset has any null value

Creating variables

As we can see that the data frame contains nine variables in nine columns. The first eight columns contain the independent variables. These are some physiological variables having a correlation with diabetes symptoms. The ninth column shows if the patient is diabetic or not. So, here the x stores the independent variables and y stores the dependent variable diabetes count.

x=dataset.iloc[:,: -1]
y=dataset.iloc[:,-1]

Splitting the data for training and testing

Here we will split the data set in training and testing set with 80:20 ratio. We will use the train_test_split function of the scikit-learn library. The test_size mentioned in the code decides what proportion of data will be kept aside to test the trained model. The test data will remain unused in the training process and will act as an independent data during testing.

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test=train_test_split(x,y, test_size=0.2, random_state=0)

Fitting the Naive Bayes’ model

Here we fit the model with the training set.

model=GaussianNB()
model.fit(x_train,y_train)

Using the Naive Bayes’ model for prediction

Now as the model has been fitted using the training set, we will use the test data to make prediction.

y_pred=model.predict(x_test)

Checking the accuracy of the fitted model

As we already have the observations corresponding to the test data set, so, we can compare that with the prediction to check how accurate the model’s prediction is. Scikit-learn’s metrics module has the function called accuracy_score which we will use here.

from sklearn import metrics
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))

Conclusion

So, we have completed the whole process of applying Naive Bayes’ classification using python and also we are now through its basic concepts. It will be a little confusing at first. As you solve more practical problems with this application you will become more confident. 

This particular classifying technique is actually based on the Bayesian classification method. The name “Naive” it gets due to its oversimplification of the original Bayes theorem. The Naive Bayes classifier assumes that each pair of features has the conditional independence given the value of the target variable.

The Naive Bayes classifier can be a good choice for all types of classification problem be it binomial or multinomial. The algorithms extremely fast and straightforward technique can help us to take a quick decision. If the result of this classifier is accurate enough (which is the most common case) then it’s fine otherwise we can always take help of other classifiers like decision tree or random forest etc.

So, I hope this article will help you gain an in-depth knowledge about Naive Bayes’ theory and its application to solve real-world problems. In case of any doubt or queries please let me know through comments below.

References

Comparing the performance of different machine learning algorithms

Comparing machine learning algorithms

Comparing Machine Learning Algorithms (MLAs) are important to come out with the best-suited algorithm for a particular problem. This post discusses comparing different machine learning algorithms and how we can do this using scikit-learn package of python. You will learn how to compare multiple MLAs at a time using more than one fit statistics provided by scikit-learn and also creating plots to visualize the differences.

Machine Learning Algorithms (MLA) are very popular to solve different computational problems. Especially when the data set is huge and complex with no parameters known MLAs are like blessings to data scientists. The algorithms quickly analyze the data to learn the dependencies and relations between the variables and produce estimation with lot more accuracy than the conventional regression models.

Most common and frequently used machine learning models are supervised models. These models tend to learn about the data from experience. Its like the labelled data acts as teacher to train it to be perfect. As the training data size increases the model estimation gets more accurate.

Here are some recommended articles to know the basics of machine learning

NB: Being a non-native English speaker, I always take extra care to proofread my articles with Grammarly. It is the best grammar and spellchecker available online. Read here my review of using Grammarly for more than two years. 

Why we should compare machine learning algorithms

Other types of MLAs are the unsupervised and semi-supervised type which are helpful when the training data is not available and still we have to make some estimation. As these models are not trained using labelled data set naturally, these algorithms are not as accurate as supervised ones. But still, they have their own advantages.

All these MLAs are useful depending on situations and data types and to have the best estimation. That’s why selecting a particular MLA is essential to come with a good estimation. There are several parameters which we need to compare to judge the best model. After that, the best found model need to be tested on an independent data set for its performance. Visualization of the performance is also a good way to compare between the models quickly.

So, here we will compare most of the MLAs using resampling methods like cross validation technique using scikit-learn package of python. And then model fit statistics like accuracy, precision, recall value etc will be calculated for comparison. ROC (Receiver Operating Characteristic) curve is also a easy to understand process for MLA comparison; so finally in a single figure all ROCs will be put to for the ease of model comparison.

Data set used

The same data set used here for application of all the MLAs. The example dataset I have used here for demonstration purpose is from kaggle.com. The data collected by “National Institute of Diabetes and Digestive and Kidney Diseases”  contains vital parameters of diabetes patients belong to Pima Indian heritage.

Here is a glimpse of the first ten rows of the data set:

Diabetes data set for logistic regression
Diabetes data set for ANN

The data set has independent variables as several physiological parameters of a diabetes patient. The dependent variable is if the patient is suffering from diabetes or not. Here the dependent column contains binary variable 1 indicating the person is suffering from diabetes and 0 he is not a patient of diabetes.

Code for comparing different machine learning algorithms

Lets jump to coding part. It is going to be a little lengthy code and a lot of MLAs will be compared. So, I have break down the complete code in segments. You can directly copy and pest the code and make little changes to suit your data.

Importing required packages

The first part is to load all the packages needed in this comparison. Besides the basic packages like pandas, numpy, matplotlib we will import some of the scikit-learn packages for application of the MLAs and their comparison.

#Importing basic packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#Importing sklearn modules
from sklearn.metrics import mean_squared_error,confusion_matrix, precision_score, recall_score, auc,roc_curve
from sklearn import ensemble, linear_model, neighbors, svm, tree, neural_network
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn import svm,model_selection, tree, linear_model, neighbors, naive_bayes, ensemble, discriminant_analysis, gaussian_process
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

Importing the data set and checking if there is any NULL values

This part of code will load the diabetes data set and check for any null values in the data frame.

#Loading the data and checking for missing values
dataset=pd.read_csv('diabetes.csv')
dataset.isnull().sum()

Checking the data set for any NULL values is very essential, as MLAs can not handle NULL values. We have to either eliminate the records with NULL values or replace them with the mean/median of the other values. we can see each of the variables are printed with number of null values. This data set has no null values so all are zero here.

No NULL values in the data set

Storing the independent and dependent variables

As we can see that the data frame contains nine variables in nine columns. The first eight columns contain the independent variables. These are some physiological variables having a correlation with diabetes symptoms. The ninth column shows if the patient is diabetic or not. So, here the x stores the independent variables and y stores the dependent variable diabetes count.

# Creating variables for analysis
x=dataset.iloc[:,: -1]
y=dataset.iloc[:,-1]

Splitting the data set

Here the data set has been divided into train and test data set. The test data set size is 20% of the total records. This test data will not be used in model training and work as an independent test data.

# Splitting train and split data
x_train, x_test, y_train, y_test=train_test_split(x,y,test_size=0.2, random_state=0)

Storing machine learning algorithms (MLA) in a variable

Some very popular MLAs we have selected here for comparison and stored in a variable; so that they can be used at later part of the process. The MLAs first we have taken up for comparison are Logistic Regression, Linear Discriminant Analysis, K-nearest neighbour classifier, Decision tree classifier, Naive-Bayes classifier and Support Vector Machine.

# Application of all Machine Learning methods
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))

Creating a box plot to compare there accuracy

This part of code creates a box plot for all the models against their cross validation score.

# evaluate each model in turn
results = []
names = []
scoring = 'accuracy'
for name, model in models:
	kfold = model_selection.KFold(n_splits=10, random_state=seed)
	cv_results = model_selection.cross_val_score(model, x_train, y_train, cv=kfold, scoring=scoring)
	results.append(cv_results)
	names.append(name)
	msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
	print(msg)
# boxplot algorithm comparison
fig = plt.figure()
fig.suptitle('Comparison between different MLAs')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

The cross validation score are printed below and it is clearly suggesting Logistic Regression and Linear Discriminant Analysis to the two most accurate MLAs.

Below is a box-whisker plot to visualize the same result.

Comparison between different MLAs

Comparing all machine learning algorithms

# Application of all Machine Learning methods
MLA = [
    #GLM
    linear_model.LogisticRegressionCV(),
    linear_model.PassiveAggressiveClassifier(),
    linear_model. RidgeClassifierCV(),
    linear_model.SGDClassifier(),
    linear_model.Perceptron(),
    
    #Ensemble Methods
    ensemble.AdaBoostClassifier(),
    ensemble.BaggingClassifier(),
    ensemble.ExtraTreesClassifier(),
    ensemble.GradientBoostingClassifier(),
    ensemble.RandomForestClassifier(),

    #Gaussian Processes
    gaussian_process.GaussianProcessClassifier(),
    
    #SVM
    svm.SVC(probability=True),
    svm.NuSVC(probability=True),
    svm.LinearSVC(),
    
    #Trees    
    tree.DecisionTreeClassifier(),
  
    #Navies Bayes
    naive_bayes.BernoulliNB(),
    naive_bayes.GaussianNB(),
    
    #Nearest Neighbor
    neighbors.KNeighborsClassifier(),
    ]
MLA_columns = []
MLA_compare = pd.DataFrame(columns = MLA_columns)

row_index = 0
for alg in MLA:  
    
    predicted = alg.fit(x_train, y_train).predict(x_test)
    fp, tp, th = roc_curve(y_test, predicted)
    MLA_name = alg.__class__.__name__
    MLA_compare.loc[row_index,'MLA used'] = MLA_name
    MLA_compare.loc[row_index, 'Train Accuracy'] = round(alg.score(x_train, y_train), 4)
    MLA_compare.loc[row_index, 'Test Accuracy'] = round(alg.score(x_test, y_test), 4)
    MLA_compare.loc[row_index, 'Precission'] = precision_score(y_test, predicted)
    MLA_compare.loc[row_index, 'Recall'] = recall_score(y_test, predicted)
    MLA_compare.loc[row_index, 'AUC'] = auc(fp, tp)

    row_index+=1
    
MLA_compare.sort_values(by = ['MLA Test Accuracy'], ascending = False, inplace = True)    
MLA_compare
Comparison of all machine learning algorithms

# Creating plot to show the train accuracy
plt.subplots(figsize=(13,5))
sns.barplot(x="MLA used", y="Train Accuracy",data=MLA_compare,palette='hot',edgecolor=sns.color_palette('dark',7))
plt.xticks(rotation=90)
plt.title('MLA Train Accuracy Comparison')
plt.show()
MLA train accuracy comparison
# Creating plot to show the test accuracy
plt.subplots(figsize=(13,5))
sns.barplot(x="MLA used", y="Test Accuracy",data=MLA_compare,palette='hot',edgecolor=sns.color_palette('dark',7))
plt.xticks(rotation=90)
plt.title('Accuraccy of different machine learning models')
plt.show()
Accuracy of different machine learning algorithms
# Creating plots to compare precission of the MLAs
plt.subplots(figsize=(13,5))
sns.barplot(x="MLA used", y="Precission",data=MLA_compare,palette='hot',edgecolor=sns.color_palette('dark',7))
plt.xticks(rotation=90)
plt.title('Comparing different Machine Learning Models')
plt.show()
Comparing different machine learning algorithms
# Creating plots for MLA recall comparison
plt.subplots(figsize=(13,5))
sns.barplot(x="MLA used", y="Recall values",data=MLA_compare,palette='hot',edgecolor=sns.color_palette('dark',7))
plt.xticks(rotation=90)
plt.title('MLA Recall Comparison')
plt.show()
Recall comparison of all Machine learning algorithms
# Creating plot for MLA AUC comparison
plt.subplots(figsize=(13,5))
sns.barplot(x="MLA used", y="AUC values",data=MLA_compare,palette='hot',edgecolor=sns.color_palette('dark',7))
plt.xticks(rotation=90)
plt.title('MLA AUC Comparison')
plt.show()
MLA AUC comparison

Creating ROC for all the models applied

Receiver Operating Characteristic (ROC) curve is a very important tool to diagnose the performance of MLAs by plotting the true positive rates against the false-positive rates at different threshold levels. The area under ROC curve often called AUC and it is also a good measure of the predictability of the machine learning algorithms. A higher AUC is an indication of more accurate prediction.

# Creating plot to show the ROC for all MLA
index = 1
for alg in MLA:
    
    
    predicted = alg.fit(x_train, y_train).predict(x_test)
    fp, tp, th = roc_curve(y_test, predicted)
    roc_auc_mla = auc(fp, tp)
    MLA_name = alg.__class__.__name__
    plt.plot(fp, tp, lw=2, alpha=0.3, label='ROC %s (AUC = %0.2f)'  % (MLA_name, roc_auc_mla))
   
    index+=1

plt.title('ROC Curve comparison')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.plot([0,1],[0,1],'r--')
plt.xlim([0,1])
plt.ylim([0,1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')    
plt.show()
ROC curve comparison

Conclusion

This post presents a detailed discussion on how we can compare several machine learning algorithms at a time to fund out the best one. The comparison task has been completed using different functions of scikit-learn package of python. We took help of some popular fit statistics to draw a comparison between the models. Additionally, the Receiver Operating Characteristic (ROC) is also a good measure of comparing several MLAs.

I hope this guide will help you to conclude your problem in hand and to proceed with the best MLA chosen through a rigorous comparison method. Please feel free to try the python code given here, copy-pest the code in your python compiler, run and apply on your data. In case of any problem faced in executing the comparison process write me in the comment below.

References

Decision tree for classification and regression using Python

Decision tree

Decision tree classification is a popular supervised machine learning algorithm and frequently used to classify categorical data as well as regressing continuous data. In this article, we will learn how can we implement decision tree classification using Scikit-learn package of Python

Decision tree classification helps to take vital decisions in banking and finance sectors like whether a credit/loan should be given to a customer or not depending on his risk bearing credentials; in medical test conditions like if a new medicine should be tried on a patient depending on his/her medical history and many more fields.

The above two cases are where the target variable is a bivariate one i.e. with only two categories of response. There can be cases where the target variable has more than two categories, the decision tree can be applied in such multinomial cases too. The decision tree can also handle both numerical and categorical data. So, no doubt a decision tree gives a lot of liberty to its users.

NB: Being a non-native English speaker, I always take extra care to proofread my articles with Grammarly. It is the best grammar and spellchecker available online. Read here my review of using Grammarly for more than two years. 

Introduction to decision tree

Decision tree problems generally consist of some existing conditions which determine its categorical response. If we arrange the conditions and the decisions depending on those conditions and again one of those decisions resulting in further decisions; the whole structure of decision making resembles a tree structure. Hence the name decision tree.

The first and topmost condition which initiates the decision-making process is called the root condition. The nodes from the root node are called either a leaf node or decision node depending on which one takes part in further decision making. In this way, a recursive process of continues unless and until all the elements are grouped into particular categories and final nodes are all leaf nodes.

An example of decision tree

Here we can take an example of recent COVID-19 epidemic problem related to the testing of positive cases. We all know that the main problem with this disease is that it is very infectious. So, to identify COVID positive patients and isolating them is very essential to stop its further spread. This needs rigorous testing. But COVID testing is a time consuming and resource-intensive process. It becomes more of a challenge in the case of countries like India with a strong 1.3 billion population.

So, if we can categorize which persons actually need testing it can save a lot of time and resources. We can straightway downsize the testing population significantly. So, it is a kind of divide and conquer policy. See the below decision tree for classifying persons who need to be tested.

An example of decision tree
An example of decision tree

The whole classification process is much similar to how a human being judges a situation and makes a decision. That’s why this machine learning technique is simple to understand and easier to implement. Further being a non-parametric approach this algorithm is applicable to any kind of data even when the distribution is not known.

The distinct character of a decision tree which makes it special among all other machine learning algorithms is that unlike them it is a white box technique. That means the logic used in the classification process is visible to us. Due to simple logic, the training time for this algorithm is far less even when the data size is huge with high dimensionality. Moreover, it is the decision tree which makes the foundation of advanced machine learning computing technique like the random forest, bagging, gradient boosting etc.

Advantages of decision tree

  • The decision tree has a great advantage of being capable of handling both numerical and categorical variables. Many other modelling techniques can handle only one kind of variable.
  • No data preprocessing is required. Except for missing values no other data processing steps like data standardization, use of dummy variables for categorical data are required for decision tree which saves a lot of user’s time.
  • The assumptions are not too rigid and model can slightly deviate from them.
  • The decision tree model validation can be done through statistical tests and the reliability can be established easily.
  • As it is a white box model, so the logic behind it is visible to us and we can easily interpret the result unlike the black-box model like an artificial neural network.

Now no technique can be without any flaws, there are always some flipside and decision tree is no exception.

Disadvantages of Decision tree

  • A very serious problem with a decision tree is that it is very much prone to overfitting. That means the prediction given by decision tree is often too accurate for a too specific situation with a too complex model. 
  • The classification by decision tree generally uses an algorithm which tends to find a local optimum result for each node. As this process follows recursively for each node, ultimately the whole process ends up finding a locally optimal instead of a globally optimal decision tree.
  • The result obtained from a decision tree is very unstable. A little variation in the data can lead to a completely different classification/regression result. That’s why the concept of random forest/ensemble technique came, this technique brings together the best result obtained from a number of models instead of relying on a single one.

Classification and Regression Tree (CART)

The decision tree has two main categories classification tree and regression tree. These two terms at a time called as CART. This term was first coined in 1984 by Leo Breiman, Jerome Friedman, Richard Olshen and Charles Stone. 

Classification

When the response is categorical in nature, the decision tree performs classification. Like the examples, I gave before, whether a person is sick or not or a product is pass or fail in a quality test. In all these cases the problem in hand is to include the target variable into a group. 

The target variable can be a binomial that is with only two categories like yes-no, male-female, sick-not sick etc. or the target variable can be multinomial that is with more than two categories. An example of a multinomial variable can be the economic status of people. It can have categories like very rich, rich, middle class, lower-middle class, poor, very poor etc. Now the benefit of the decision tree is a decision tree is capable of handling both binomial and multinomial variables.

Regression

On the other hand, the decision tree has its application in regression problem when the target variable is of continuous nature. For example, predicting the rainfall of a future date depending on other weather parameters. Here the target variable is a continuous one. So, it is a problem of regression. 

Application of Decision tree with Python

Here we will use the sci-kit learn package to implement the decision tree. The package has a function called DecisionTreeClasifier() which is capable of classifying both binomial (target variable with only two classes) and multinomial (target variable having more than two classes) variables.

Performing classification using decision tree

Importing required libraries

The first step to start coding is to import all the libraries we are going to use. The basic libraries for any kind of data science projects are like pandas, numpy, matplotlib etc. The purpose of these libraries has an elaborate discussion in the article simple linear regression with python.

# importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

About the data

The example dataset I have used here for demonstration purpose is from kaggle.com. The data collected by “National Institute of Diabetes and Digestive and Kidney Diseases”  contains vital parameters of diabetes patients belong to Pima Indian heritage.

Here is a glimpse of the first ten rows of the data set:

Diabetes data set for logistic regression
Diabetes data set for ANN

The data set has independent variables as several physiological parameters of a diabetes patient. The dependent variable is if the patient is suffering from diabetes or not. Here the dependent column contains binary variable 1 indicating the person is suffering from diabetes and 0 he is not a patient of diabetes.

dataset=pd.read_csv('diabetes.csv')
dataset.head()
# Printing data details
print(dataset.info) # for a quick view of the data
print(dataset.head) # printing first few rows of the data
dataset.tail        # to show last few rows of the data
dataset.sample(10)  # display a sample of 10 rows from the data
dataset.describe    # printing summary statistics of the data
pd.isnull(dataset)  # check for any null values in the data
Checking if the dataset has any null value

Creating variables

As we can see that the data frame contains nine variables in nine columns. The first eight columns contain the independent variables. These are some physiological variables having a correlation with diabetes symptoms. The ninth column shows if the patient is diabetic or not. So, here the x stores the independent variables and y stores the dependent variable diabetes count.

x=dataset.iloc[:,:-1].values
y=dataset.iloc[:,-1].values

Performing the classification

To do the classification we need to import the DecisionTreeClassifier() from sklearn. This special classifier is capable of classifying binary variable i.e. variable with only two classes as well as multiclass variables.

# Use of the classifier
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(x, y)

Plotting the tree

Now as the model is ready we can create the tree. The below line will create the tree.

tree.plot_tree()clf

Generally the plot thus created, is of very low resolution and gets distorted while using as image. One solution of this problem is to print it in pdf format, thus the resolution gets maintained.

# The dicision tree creation
tree.plot_tree(clf) 
plt.savefig('DT.pdf')

Another way to print a high resolution and quality image of the tree is to use Graphviz format importing export_graphviz() from tree.

# Creating better graph
import graphviz 
dot_data = tree.export_graphviz(clf, out_file=None) 
graph = graphviz.Source(dot_data) 
graph.render("diabetes") 
Decision tree to classify the data
Decision tree created using Graphviz

The tree represents the logic of classification in a very simple way. We can easily understand how the data has been classified and the steps to achieve that.

Performing regression using decision tree

About the data set

The dataset I have used here for demonstration purpose is from https://www.kaggle.com. The dataset contains the height and weight of persons and a column with their genders. The original dataset has more than thousands of rows, but for this regression purpose, I have used only the first 50 rows containing data on 25 male and 25 females.

Importing libraries

Additional to the basic libraries we imported in a classification problem, here we will need to import the DecisionTreeRegressor() from sklearn.

# Import the necessary modules and libraries
import numpy as np
from sklearn.tree import DecisionTreeRegressor
import matplotlib.pyplot as plt

Reading the dataset

I have already mentioned about the dataset used here for demonstration purpose. The below code is to import the data and store in a dataframe called dataset.

dataset=pd.read_csv('weight-height.csv')
print(dataset)

Here is a glimpse of the dataset

Dataset for random forest regression

Creating variables

As we can see that the dataframe contains three variables in three columns. The last two columns are only of our interest. We want to regress the weight of a person using the height of him/her. So, here the independent variable height is x and the dependent variable weight is y.

x=dataset.iloc[:,1:2].values
y=dataset.iloc[:,-1].values

Splitting the dataset

This is a common practice of splitting the whole data set for creating training and testing data set. Here we have set the test_size as 20% that means the training data set will consist 80% of the total data. The test data set works as an independent data set when need to test the classifier after it gets trained with training data.

# Splitting the data for training and testing
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test=train_test_split(x,y, test_size=0.20, random_state=0)

Fitting the decision tree regression

We have here fitted decision tree regression with two different depth values two draw a comparison between them.

# Creating regression models with two different depths
regr_1 = DecisionTreeRegressor(max_depth=2)
regr_2 = DecisionTreeRegressor(max_depth=5)
regr_1.fit(x_train, y_train)
regr_2.fit(x_train, y_train)

Prediction

The below line of codes will give predictions from both the regression models with two different depth values using a new independent variable set X_test.

# Making prediction
X_test = np.arange(50,75, 0.5)[:, np.newaxis]
y_1 = regr_1.predict(X_test)
y_2 = regr_2.predict(X_test)

Visualizing prediction performance

The below line of codes will generate a height vs weight scattered plot alongwith two prediction lines created from two different regression models.

# Plot the results
plt.figure()
plt.scatter(x, y, s=20, edgecolor="black",
            c="darkorange", label="data")
plt.plot(X_test, y_1, color="cornflowerblue",
         label="max_depth=2", linewidth=2)
plt.plot(X_test, y_2, color="yellowgreen", label="max_depth=5", linewidth=2)
plt.xlabel("Height")
plt.ylabel("Weight")
plt.title("Decision Tree Regression")
plt.legend()
plt.show()

Conclusion

In this post, you have learned about the decision tree and how it can be applied for classification as well as regression problem using scikit-learn of python.

The decision tree is a popular supervised machine learning algorithm and frequently used by data scientists. Its simple logic and easy algorithm are the main reason behind its popularity. Being a white box type algorithm, we can clearly understand how it is doing its work.

The DecisionTreeClassifier() and DecisionTreeRegressor() of scikit-learn are two very useful functions for applying decision tree and I hope you are confident about their use after reading this article.

If you have any question regarding this article or any confusion about its application in python post them in the comment below and I will try my best to answer them.

References

Artificial Neural Network with Python using Keras library

Artificial Neural Network

Artificial Neural Network (ANN) as its name suggests it mimics the neural network of our brain hence it is artificial. The human brain has a highly complicated network of nerve cells to carry the sensation to its designated section of the brain. The nerve cell or neurons form a network and transfer the sensation one to another. Similarly in ANN also a number of inputs pass through several layers similar to neurons and ultimately produce an estimation.

Schematic diagram of Artificial Neural Network
Schematic diagram of Artificial Neural Network
NB: Being a non-native English speaker, I always take extra care to proofread my articles with Grammarly. It is the best grammar and spellchecker available online. Read here my review of using Grammarly for more than two years. 

Perceptron: the simplest Artificial Neural Network

When any ANN consists of only one neuron it is called a perceptron. A perceptron has a single input node as well as a single output node. It is the same as the neuron in our brain consisting of dendrons and axons. 

Depending on your problem, there can be more than one neurons and even layers of neurons. In that situation, it is called multi-layer perceptron. In the above figure, we can see that there are two hidden layers. Generally we used to use ANN with 2-3 hidden layers but theoretically there is no limit.

Layers of an Artificial Neural Network

In the above figure you can see the complete network consists of some layers. Before you start with the application of ANN, understanding these layers is essential. So, here is a brief idea about the layers an ANN has

Input layer

The independent variables having real values are the components of input layer. Input variables can be more than one, discrete or continuous. They may need standardization before feeding into ANN if they have very diverse scale of data.

Hidden layer

The layers between the input and output are called hidden layers. Here the inputs gets associated with some weights and ultimately the weighted sum of all these values are calculated.

The information passed from one layer of neurons acts as inputs for the next layer of neurons. The inputs propagate through the neural network, activation function and cost function then finally yield the output.

Activation function

The weighted sum is then passed through an activation function. It has a very important role in ANN. This function controls the threshold for the output of ANN. Similar to a biological neuron which provides sensation when the impulse exceeds a particular threshold value, the ANN also only gives a particular output when the weighted sum crosses a threshold value.

The output

This is the output of ANN. The activation function yields this output from the weighted sum of the inputs.

ANN: a deep learning process

ANN is a deep learning process, the burning topic of data science. Deep learning is basically a subfield of Machine Learning. You may be familiar to the machine learning process and if not you can refer to this article for a quick working knowledge on it. Talking about deep learning, it is in recent times find its application in almost all ambitious projects. Starting from basic pattern recognition, voice recognition to face recognition, self-driving car, high-end projects in robotics and artificial intelligence deep learning is revolutionizing the modern applied science.

Read about supervised machine learning here

ANN is a very efficient and popular process of pattern recognition. But the process involves complex computations and several iterations. The advent of high-end computing devices and machine learning technologies have made our task much easier than ever. Users and researchers can now focus only on their research problem without taking the pain of implementing a complex ANN algorithm.

As time passes easier to use modules in various languages are developed encapsulating the complexity of such computation processes. The “Keras” is such a framework in Python which has made deep learning and artificial intelligence a common man’s interest and built on rather popular frameworks like TensorFlow, Theano etc. 

Here is an exhaustive article on python and how to use it

 We are going to use here this high-level API Keras to apply ANN.

Application of ANN using Keras library

Importing the libraries

The first step to start coding is to import all the libraries we are going to use. The basic libraries for any kind of data science projects are like pandas, numpy, matplotlib etc. The purpose of these libraries are discussed before in the article simple linear regression with python.

# first neural network with keras tutorial
import pandas as pd
from numpy import loadtxt
from keras.models import Sequential
from keras.layers import Dense

About the data

The example dataset I have used here for demonstration purpose has been downloaded from kaggle.com. The data collected by “National Institute of Diabetes and Digestive and Kidney Diseases”  contains vital parameters of diabetes patients belong to Pima Indian heritage.

Here is a glimpse of the first ten rows of the data set:

Diabetes data set for logistic regression
Diabetes data set for ANN

The data set has independent variables as several physiological parameters of a diabetes patient. The dependent variable is if the patient is suffering from diabetes or not. Here the dependent column contains binary variable 1 indicating the person is suffering from diabetes and 0 he is not a patient of diabetes.

dataset=pd.read_csv('diabetes.csv')
dataset.head()
# Printing data details
print(dataset.info) # for a quick view of the data
print(dataset.head) # printing first few rows of the data
dataset.tail        # to show last few rows of the data
dataset.sample(10)  # display a sample of 10 rows from the data
dataset.describe    # printing summary statistics of the data
pd.isnull(dataset)  # check for any null values in the data
Checking if the dataset has any null value

Creating variables

As we can see that the data frame contains nine variables in nine columns. The first eight columns contain the independent variables which are some physiological variables correlated with diabetes symptoms. The ninth column showes if the patient is diabetic or not. So, here the independent variables are stored in x and the dependent variable diabetes count is stored in y.

x=dataset.iloc[:,:-1].values
y=dataset.iloc[:,-1].values
print(x)
print(y)

Preprocessing the data

This is standard practice before we start with analysis on any data set. Especially if the data set has variables with different scales. In this data also we have variables which have a completely different scale of data. Some of them in fractions whereas some of them with big whole numbers.

To do away with such differences between the variables data standardization is very effective. The preprocessing module of sklearn package has a function called StandardScaler() which does the work for us.

#Normalizing the data
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x = sc.fit_transform(x)

Create a heat map

Before we proceed for analysis, we should have a through idea about the variables in study and their inter relationship. A very handy way to have a quick knowledge about the variables is to create a heat map.

The following code will make a heat map. The seaborn” package has the required function to do this.

# Creating heat map for correlation study
import seaborn as sns
corr = dataset.corr()
sns.heatmap(corr, 
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values)
plt.show()
Heat map for correlation study among the variables
Heat map for correlation study among the variables

The heat map is very good visualization technique to easily apprehend the relation between variables. The colour sheds are the indication of correlation here. The lighter shades depict a high correlation and as the shades get darker the correlation is decreased.

The diagonal elements of a heat map is always one as they are correlation between the same variable. As we expected we can find some variables here which have higher correlation which was not possible to identify from the raw data. For example pregnancies and age, insulin and glucose, skinthikness have a higher correlation.

Splitting the dataset in training and test data

For testing purpose, we need to separate a part of the complete dataset which will not be used for model building. The thumb rule is to use the 80% of data for modelling and keep aside the rest of the data. It will work as an independent dataset. Now we need to test the fitted model’s performance using this independent dataset.

# Splitting the data for training and testing
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test=train_test_split(x,y, test_size=0.20, random_state=0)

Here this data splitting task has been performed with the help of model_selection module of sklearn library. This module has an inbuilt function called train_test_split which automatically divides the dataset into two parts. The argument test_size controls the proportion of the test data. Here the test size is 0.2 so the test dataset will contain 20% of the complete data.

Modelling the data

So we have completed all the prerequisite steps before modelling the data. Here the response variable is a binary variable having 0 and 1 as output. A multilayer perceptron ANN is the best suited to model such data. In this type of ANN, each layer remains connected to each other and works as input layer for the immediate next neuron layer.

For using a multilayer perceptron, Keras sequential model is the easiest way to start. To use sequential model we have used model=sequential(). The activation function here is the most common relu function frequently used to implement neural network using Keras.

# define the keras model
model = Sequential()
model.add(Dense(12, input_dim=8, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

Compiling the model

As the model is defined we will now compile the model with adam optimizer and the loss function called binary_crossentropy. While the training process will continue in several iterations, we can check the model’s accuracy with the [‘accuracy‘] argument passed in metrics function.

# compile the keras model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

While compiling the model these two arguments loss and optimizer plays an important role. The loss function generally depends on the particular problem you are addressing through ANN. For example, if you have a regression problem then the loss function you will be using is Mean Squared Error (MSE).

In this case as we are dealing with a binary response variable so the loss function here is binary_crossentropy. If the response variable consists of more than two classes then the loss function should be categorical_crossentropy.

In a similar way the optimization algorithm used here is adam. There are several others also like RMSprop, Stochastic Gradient Descent (SGD) etc. and their selection has an impact on the tuning model’s learning and momentum.

Fitting the model

Fitting the model has again two crucial parameters. Initializing them with optimum values to a great extent determines model’s efficiency and performance. Here the epochs decides how many iterations will be there through the training set.

And the batch_size is as the name suggests is actually the batch of input samples passed at a time through the ANN. It increases the efficiency of the model as the model does not have to process the whole input at a time.

# fit the keras model on the training set
train=model.fit(x_train, y_train, epochs=100, batch_size=10)

Here I have mentioned batch_size with 10 will enter at a time and total epochs will be 100. See the below output screenshot, here first 10 epochs is captured with the model’s accuracy at every epoch.

Evaluating the model

As the model trained and compiled we can check the model’s accuracy. For the model’s accuracy, Keras has model. evaluate function which gives accuracy value as 68.24. But you have to keep in mind that this accuracy can vary and may get changed each time the ANN runs.

# evaluate the keras model
_,accuracy = model.evaluate(x_train, y_train)
print('Accuracy: %.2f' % (accuracy*100))

Prediction using the model

Now the model is ready for making prediction. The values of x_test are privided as ANN inputs.

# make probability predictions with the model
# make probability predictions with the model
predictions = model.predict(x_test)
# round predictions 
rounded = [round(x[0]) for x in predictions]
print(rounded[:10])
print(y_test[:10])

I have printed here both the predicted y_test results as well as the original y_test values (first 10 values only) and it is clear that the prediction is correct for all of them.

Comparing the predicted values and the original values of test set (first 10 values only)
Comparing the predicted values and the original values of test set (first 10 values only)

Visualizing the models performance

# Visualizing training process with validation and accuracies
import matplotlib.pyplot as plt
plt.plot(train.history['accuracy'])
plt.plot(train.history['val_accuracy'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()
plt.plot(train.history['loss']) 
plt.plot(train.history['val_loss']) 
plt.title('Model loss') 
plt.ylabel('Loss') 
plt.xlabel('Epoch') 
plt.legend(['Train', 'Test'], loc='upper left') 
plt.show()

Conclusion

So we have just completed our first deep learning model to solve a real world problem. This was a very simple problem with a smaller data size just for demonstration purpose. But the basic principal for fitting an ANN will be same everywhere irrespective of data complexity and size. Important is you should know how it works.

Future scope

We have obtained here an accuracy of ANN of 68.24 which has a lot of scopes to get improved. So we need to put further effort to improve the model. You can start with this by tweaking the number of layers the network has, the optimization and loss function used in the model definition and also the epochs and batch_size. Changing these parameters of the model may result in further higher accuracy.

For example in this particular example, if we increase the epochs number from 100 to 200 the accuracy increases to 77% !!!. It is quite a jump in the model efficiency. Likewise simple change in other parameters can also be very helpful.

If there is scope using more sample data in training the model also an effective way of increasing the model’s prediction efficiency. So, once you have a defined model in you hand there is ample scope you can always think of improving it.

Hope this article will help you to take big step forward towards the vast, dynamic and very interesting world of deep learning and AI.

References: