The Naive Bayes classifier is very straight forward, easy and fast working machine learning technique. It is one of the most popular supervised machine learning techniques to classify data set with high dimensionality. In this article, you will get a thorough idea about how this algorithm works and also a step by step implementation with python. Naive Bayes’ actually a simplified form of Bayes’ theorem so we will cover that too.
“Under Bayes’ theorem, no theory is perfect, Rather, it is a work in progress, always subject to further refinement and testing.” ~ Nate Silver
In real life application of classification problem is everywhere. We are taking different decisions in our daily life judging probability of several other factors either consciously or unconsciously. When we are in need to analyse large data and take a decision on its basis, we need some tool. Naive Bayes classifier is the simplest and very fast supervised learning algorithm which is also accurate enough. So, it can make our life far easier in taking vital decisions.
The concept of Bayes’ theorem
To know the Naive Bayes’ classification concept we have to understand the Bayes’ theorem first. A Bayesian classification describes the relationship between conditional probabilities of different events. This theorem calculates the probability of any hypothesis provided the information of any event.
For example, we the cricket lovers try to guess whether we will be able to play today depending on the weather variables. A banker tries to make sure if the customer is risky to give a credit depending on his financial transaction history or a businessman tries to judge whether his newly launched product is going to be a hit or flop among the customer depending on the customers buying behaviour.
This type of model dealing with conditional probabilities is called generative models. They are generative because of the fact they actually specify the hypothetical random process of data generation. But the training of such generative models for each event is really very difficult task.
So how to tackle this issue? Here comes the concept of Naive Bayes’ classifier. The name Naive because it assumes some very simple things about the Bayes’ model. Like the presence of any feature in any class does not depend on any other feature. It simply overlooks the relationship between the features and considers that all the features independently contributes toward the target variable.
In the data set, the feature variable is test report having values as positive and negative whereas the binomial target variable is “sick” with values as “yes” or “no”. Let us assume the data set has 20 cases of test results which are as below:
Creating a frequency table of the attributes of the data set
So if we create the frequency table for the above data set it will look like this
With the help of this frequency table, we can now prepare the likelihood table to calculate prior and posterior probabilities. See the below figure.
With the help of this above table, we can now calculate what is the probability that a person is really suffering from a disease when his test report was also positive.
So the probability we want to compute is
We have already calculated the probabilities, so, we can directly put the values in the above equation and get the probability we want to calculate.
In the same fashion, we can also calculate the probability of a person of not having a disease in spite of the test report being positive.
Application of Naive Bayes’ classification with python
Now the most interesting part of the article is here. We will implement Naive Bayes’ classification using python. To do that we will use the popular scikit-learn library and its functions.
About the data
We will take the same diabetes data we have used earlier in other classification problem.
The purpose of using the same data for all classification problems is to make you able to compare between different algorithms. You can judge the accuracy of each algorithm with their accuracies in classifying the data.
So, here the target variable has two classes that is if the person has diabetes or not. On the other hand, we have 9 independent or feature variables influencing the target variable.
Importing required libraries
The first step to start coding is to import all the libraries we are going to use. The basic libraries for any kind of data science projects are like pandas, numpy, matplotlib etc. The purpose of these libraries has an elaborate discussion in the article simple linear regression with python.
import numpy as np import matplotlib.pyplot as plt import pandas as pd from sklearn.naive_bayes import GaussianNB import seaborn as sns
About the data
The example dataset I have used here for demonstration purpose is from kaggle.com. The data collected by “National Institute of Diabetes and Digestive and Kidney Diseases” contains vital parameters of diabetes patients belong to Pima Indian heritage.
Here is a glimpse of the first ten rows of the data set:
The data set has independent variables as several physiological parameters of a diabetes patient. The dependent variable is if the patient is suffering from diabetes or not. Here the dependent column contains binary variable 1 indicating the person is suffering from diabetes and 0 he is not a patient of diabetes.
dataset=pd.read_csv('diabetes.csv') dataset.shape dataset.head()
# Printing data details print(dataset.info) # for a quick view of the data print(dataset.head) # printing first few rows of the data dataset.tail # to show last few rows of the data dataset.sample(10) # display a sample of 10 rows from the data dataset.describe # printing summary statistics of the data pd.isnull(dataset) # check for any null values in the data
As we can see that the data frame contains nine variables in nine columns. The first eight columns contain the independent variables. These are some physiological variables having a correlation with diabetes symptoms. The ninth column shows if the patient is diabetic or not. So, here the x stores the independent variables and y stores the dependent variable diabetes count.
x=dataset.iloc[:,: -1] y=dataset.iloc[:,-1]
Splitting the data for training and testing
Here we will split the data set in training and testing set with 80:20 ratio. We will use the train_test_split function of the scikit-learn library. The test_size mentioned in the code decides what proportion of data will be kept aside to test the trained model. The test data will remain unused in the training process and will act as an independent data during testing.
from sklearn.model_selection import train_test_split x_train, x_test, y_train, y_test=train_test_split(x,y, test_size=0.2, random_state=0)
Fitting the Naive Bayes’ model
Here we fit the model with the training set.
Using the Naive Bayes’ model for prediction
Now as the model has been fitted using the training set, we will use the test data to make prediction.
Checking the accuracy of the fitted model
As we already have the observations corresponding to the test data set, so, we can compare that with the prediction to check how accurate the model’s prediction is. Scikit-learn’s metrics module has the function called accuracy_score which we will use here.
from sklearn import metrics print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
So, we have completed the whole process of applying Naive Bayes’ classification using python and also we are now through its basic concepts. It will be a little confusing at first. As you solve more practical problems with this application you will become more confident.
This particular classifying technique is actually based on the Bayesian classification method. The name “Naive” it gets due to its oversimplification of the original Bayes theorem. The Naive Bayes classifier assumes that each pair of features has the conditional independence given the value of the target variable.
The Naive Bayes classifier can be a good choice for all types of classification problem be it binomial or multinomial. The algorithms extremely fast and straightforward technique can help us to take a quick decision. If the result of this classifier is accurate enough (which is the most common case) then it’s fine otherwise we can always take help of other classifiers like decision tree or random forest etc.
So, I hope this article will help you gain an in-depth knowledge about Naive Bayes’ theory and its application to solve real-world problems. In case of any doubt or queries please let me know through comments below.