Logistic regression: classify with python

Logistic regression

Logistic regression is a very common and popularly used supervised classification process. When we have categorical data in our hand to make some prediction we tend to apply logistic regression. Classification is a very popular prediction technique. Almost 70% of real-world prediction problems involve categorical variable and hence amenable to classification.

Read about supervised machine learning here

This article covers the basic idea of logistic regression and its implementation with python. The reason behind choosing python to apply logistic regression is simply because Python is the most preferred language among the data scientists. And in the near future also it is going to rule the world of data science.

Here is an exhaustive article on python and how to use it

Why logistic regression not “classification”?

So why the name is “regression” when it performs classification? It is a very natural question you should be asking. So, the answer is it is basically a regression process which becomes a classification process when the process involves a decision threshold for the prediction. Deciding a threshold for the classification process is very important and tricky one too.

We need to decide the decision threshold depending on the particular case in hand. There can be four types of responses in case of classification problems which are “true positive”, “true negative”, “false positive” and “false negative” (will discuss them in a bit while discussing confusion matrix). We have to fix the probability of one type of occurrence while reducing another depending on its severity.

For example, take the case for a severe crime and it is to decide if the person should be hanged or not. It is a problem of binary classification with two outputs guilty or not guilty. Here the true positive case is the person found guilty when he actually has committed the crime. On the other hand, the true negative is the person found guilty when he has not committed the crime.

So, no doubt the true negative case here is of very serious type and should be avoided at any cost. Hence while fixing the decision threshold, you should try to reduce the probability of true negative while fixing the probability of true positive cases.

Here is an exhaustive article on machine learning with python

Logistic regression the basic idea

Though this process is used for classification, basically it is a regression process performed on discrete data. Unlike linear regression predicting response of a continuous variable, in logistic regression, we predict the positive outcome of a binary response variable.

Unlike linear regression which follows a linear function, a logistic regression has a sigmoid function.

Equation for logistic regression
Equation for logistic regression
Linear regression
Logistic regression

Classification types in logistic regression

Binary/binomial classification

In binary classification, the response under study can generally be classified into two groups. Examples of binary classification problems are almost everywhere in the real world.

Be it a medical test result to identify if any patient is suffering from a disease or not, a clinical test to declare a product is pass or fail in industrial quality control parameters to simple predicting whether it will rain or not. All of them are the problems of binary classification. As the response can be of only two types either positive (1) or negative (0) corresponding to every duality like “yes-no”, “pass-fail”, “male-female”, “win-loss” etc.

Multinomial classification

Here the response variable has more than two categories and they have no order. For example category of employees can be group A, Group B and Group C. They can not be arranged in any ascending or descending order.

A good example of such data can be the very famous iris data set of Sir Ronald A. Fisher regarded as the Father of statistics for his remarkable contribution. It is very much popular multivariate dataset and since long has been used as an example data set for any kind of pattern recognition problem.

The data set contains information on 3 species of iris plant with 50 instances about each species. The dependent variable here is the three species of iris plant without any order.

Ordinal classification

In this case like the multinomial variable, the response variable has more than two classes. But here the classes can be ranked in some order. Like the financial status of citizen “very poor”, “poor”, “lower middle class”, “middle class”, “rich”, “very rich”.

Any prediction problem may be a problem of binary classification or regression. Which prediction tool you will use depends on the variable type of the response variable. If the response variable is a categorical variable and have a binary response then binary classification is the solution. On the other hand, if the response is a continuous variable then we have to use regression for prediction.

For example, predicting the price of any product depending on its different specifications is a regression problem. But when we have to determine whether a customer will buy the product or not then it is certainly a problem of binary classification. Because here the response is discrete having only two types of responses possible “buy” and “not buy”.

Learn about supervised machine learning here

Application of logistic regression with python

So, I hope the theoretical part of logistic regression is already clear to you. Now it is time to apply this regression process using python.

So, lets start coding…

About the data

We already know that logistic regression is suitable for categorical data. So, the example dataset I have used here for demonstration purpose has been downloaded from kaggle.com. The data collected by “National Institute of Diabetes and Digestive and Kidney Diseases”  contains vital parameters of diabetes patients belong to Pima Indian heritage.

Here is a glimpse of the first ten rows of the data set:

Diabetes data set for logistic regression
Diabetes data set for logistic regression

The data set has independent variables as several physiological parameters of a diabetes patient. The dependent variable is if the patient is suffering from diabetes or not. Here the dependent column contains binary variable 1 indicating the person is suffering from diabetes and 0 he is not a patient of diabetes.

So, our task is to classify using logistic regression. And to predict as accurately as possible if a person is a diabetes patient from his different other vital parameters.

Importing the libraries

The first step to start coding is to import all the libraries we are going to use. The basic libraries for any kind of data science projects are like pandas, numpy, matplotlib etc. The purpose of these libraries are discussed before in the article simple linear regression with python.

# importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Reading the dataset

I have already mentioned about the dataset used here for demonstration purpose. The below code is to import the data and store in a dataframe called dataset.

dataset=pd.read_csv('diabetes.csv')
print(dataset)

Here is a glimpse of the dataset

Diabetes data frame in python
Diabetes data frame in python

Creating variables

As we can see that the data frame contains nine variables in nine columns. The first eight columns contain the independent variables which are some physiological variables correlated with diabetes symptoms. The ninth column showes if the patient is diabetic or not. So, here the independent variables are stored in x and the dependent variable diabetes count is stored in y.

x=dataset.iloc[:,1:2].values
y=dataset.iloc[:,-1].values
print(x)
print(y)

Splitting the dataset in training and test data

For testing purpose, we need to separate a part of the complete dataset which will not be used for model building. The thumb rule is to use the 80% of data for modelling and keep aside the rest of the data. It will work as an independent dataset. Now we need to test the fitted model’s performance using this independent dataset.

#****** Dividing the dataset into training and testing dataset
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size=0.2, random_state=0)

Here this data splitting task has been performed with the help of model_selection module of sklearn library. This module has an inbuilt function called train_test_split which automatically divides the dataset into two parts. The argument test_size controls the proportion of the test data. Here the test size is 0.2 so the test dataset will contain 20% of the complete data.

Application of logistic regression

Here we will be using the LogisticRegression class from sci-kit learn.

# Importing the logistic regression class and fitting the model
from sklearn.linear_model import LogisticRegression
model=LogisticRegression()
model.fit(x_train, y_train)

After importing LogisticRegression, we will create an instance of the class and then use it to fit the logistic regression on the training dataset.

Predicting using the test data

# Using the fitted model to predict using the test data
y_pred=model.predict(x_test)

As the model has been trained on the training data set, we will use it to get prediction of the test data set. The fitted model will generate a predicted data set called y_pred using x_test. We already know the original values corresponding to x_test which are in y_test. So we can compare how accurate the prediction is.

Calculating fit statistics

# Calculation different statistics to evaluate model fit
from sklearn import metrics
print("Acuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precision score:", metrics.precision_score(y_test, y_pred))
print("Recall score:", metrics.recall_score(y_test, y_pred ))

The sci-kit learn also have a class called metrics which has some useful functions to calculate fit statistics like accuracy score, precision score, recall score etc.

Model validation statistics
Model validation statistics

Here we have all the three statistics calculated. The accuracy score 0.82 suggests a good classification which suggests out of 10 observations the model can classify 8 observations correctly.

The precision and recall score are also good measure of classification process. The precision score is to measure the percentage of correct prediction. In this case, the precision score indicates that if using all the physical parameters of a person the logistic regression predicts that he/she is going to suffer from diabetes, then there is 76% chance that the prediction is correct.

The recall score of 61% says that if the test data set already has some diabetes patients, then in 61% cases the classification process can identify it.

You can further generate a more detailed report on the classification performance using classification_report() function from sci-kit learn. See below…

# Detailed classification report
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
Detailed classification report
Detailed classification report

Creating confusion matrix

Creating a confusion matrix is also an effective way to judge the model. In this case, a 2×2 matrix constitutes true negative, false negative, false positive and false negative values in the four quadrants of the matrix.

A confusion matrix example
A confusion matrix example

The below code is to create the confusion matrix using the metrics class of skit-learn library.

# Creating confusion matrix to check the accuracy of prediction
# import the metrics class
conf_matrix = metrics.confusion_matrix(y_test, y_pred)
conf_matrix
Confusion matrix
Confusion matrix

So, here is the desired confusion matrix. If we compare this matrix with the above model confusion matrix then we can say that the logistic regression has resulted 98 true negative, 9 false positive, 18 false negative and 29 true positive results.

Now, what do they mean? the terms are somewhat technical, so let me explain these terms in respect to this result. Here true negative means when the 0 predictions are correct. So here correct 0 predictions are 98. Likewise in 29 instances, the 1 predictions are correct so these are called true positives, the no. of false positives are 9 that is 9 predictions about 1 are wrong and lastly 18 predictions about 0 are wrong and they are the number of false negatives.

#Creating a heatmap for the confusion matrix
cm=conf_matrix
fig, ax = plt.subplots(figsize=(8, 8))
ax.imshow(cm)
ax.grid(False)
ax.xaxis.set(ticks=(0, 1), ticklabels=('Predicted 0s', 'Predicted 1s'))
ax.yaxis.set(ticks=(0, 1), ticklabels=('Actual 0s', 'Actual 1s'))
ax.set_ylim(1.5, -0.5)
for i in range(2):
    for j in range(2):
        ax.text(j, i, cm[i, j], ha='center', va='center', color='red')
plt.show()

Creating a ROC curve

A Receiver Operating Characteristic (ROC) curve is a good visualization technique to judge the efficiency of classification. The curve plots the true positives over the false positives and hence the optimization and adjustment of sensitivity along with specificity.

# Creating Reciever Operating Characteristic (ROC) curve
y_pred_proba = model.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test,  y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()
ROC curve for logistic regression
ROC curve for logistic regression

Here you can see an AUC score 0.87 which suggests a good classification. The score varies between 0 to 1. A score of 1 suggests perfect classification whereas any score below 0.5 suggests a poor classifier.

Conclusion

Logistic regression is a very uncomplicated classification technique based on a very simple logic. Thus computation resource required by it is comparatively much less. Another big plus of this technique is this process does not require feature scaling. So, no surprise that logistic regression has always been a favourite choice among data scientists to deal with classification problems.

But as a flip side of such simplicity logistic regression is not very efficient to perform classification when there are too many classes among the variables. It is also prone to overfitting and can not handle data with non-linear nature. There are modern machine learning techniques like Naive Bayes, support vector regression, Random Forest, decision tree etc. which are much more capable than logistic regression in handling complex data.

References