Decision tree for classification and regression using Python

Decision tree

Decision tree classification is a popular supervised machine learning algorithm and frequently used to classify categorical data as well as regressing continuous data. In this article, we will learn how can we implement decision tree classification using Scikit-learn package of Python

Decision tree classification helps to take vital decisions in banking and finance sectors like whether a credit/loan should be given to a customer or not depending on his risk bearing credentials; in medical test conditions like if a new medicine should be tried on a patient depending on his/her medical history and many more fields.

The above two cases are where the target variable is a bivariate one i.e. with only two categories of response. There can be cases where the target variable has more than two categories, the decision tree can be applied in such multinomial cases too. The decision tree can also handle both numerical and categorical data. So, no doubt a decision tree gives a lot of liberty to its users.

NB: Being a non-native English speaker, I always take extra care to proofread my articles with Grammarly. It is the best grammar and spellchecker available online. Read here my review of using Grammarly for more than two years. 

Introduction to decision tree

Decision tree problems generally consist of some existing conditions which determine its categorical response. If we arrange the conditions and the decisions depending on those conditions and again one of those decisions resulting in further decisions; the whole structure of decision making resembles a tree structure. Hence the name decision tree.

The first and topmost condition which initiates the decision-making process is called the root condition. The nodes from the root node are called either a leaf node or decision node depending on which one takes part in further decision making. In this way, a recursive process of continues unless and until all the elements are grouped into particular categories and final nodes are all leaf nodes.

An example of decision tree

Here we can take an example of recent COVID-19 epidemic problem related to the testing of positive cases. We all know that the main problem with this disease is that it is very infectious. So, to identify COVID positive patients and isolating them is very essential to stop its further spread. This needs rigorous testing. But COVID testing is a time consuming and resource-intensive process. It becomes more of a challenge in the case of countries like India with a strong 1.3 billion population.

So, if we can categorize which persons actually need testing it can save a lot of time and resources. We can straightway downsize the testing population significantly. So, it is a kind of divide and conquer policy. See the below decision tree for classifying persons who need to be tested.

An example of decision tree
An example of decision tree

The whole classification process is much similar to how a human being judges a situation and makes a decision. That’s why this machine learning technique is simple to understand and easier to implement. Further being a non-parametric approach this algorithm is applicable to any kind of data even when the distribution is not known.

The distinct character of a decision tree which makes it special among all other machine learning algorithms is that unlike them it is a white box technique. That means the logic used in the classification process is visible to us. Due to simple logic, the training time for this algorithm is far less even when the data size is huge with high dimensionality. Moreover, it is the decision tree which makes the foundation of advanced machine learning computing technique like the random forest, bagging, gradient boosting etc.

Advantages of decision tree

  • The decision tree has a great advantage of being capable of handling both numerical and categorical variables. Many other modelling techniques can handle only one kind of variable.
  • No data preprocessing is required. Except for missing values no other data processing steps like data standardization, use of dummy variables for categorical data are required for decision tree which saves a lot of user’s time.
  • The assumptions are not too rigid and model can slightly deviate from them.
  • The decision tree model validation can be done through statistical tests and the reliability can be established easily.
  • As it is a white box model, so the logic behind it is visible to us and we can easily interpret the result unlike the black-box model like an artificial neural network.

Now no technique can be without any flaws, there are always some flipside and decision tree is no exception.

Disadvantages of Decision tree

  • A very serious problem with a decision tree is that it is very much prone to overfitting. That means the prediction given by decision tree is often too accurate for a too specific situation with a too complex model. 
  • The classification by decision tree generally uses an algorithm which tends to find a local optimum result for each node. As this process follows recursively for each node, ultimately the whole process ends up finding a locally optimal instead of a globally optimal decision tree.
  • The result obtained from a decision tree is very unstable. A little variation in the data can lead to a completely different classification/regression result. That’s why the concept of random forest/ensemble technique came, this technique brings together the best result obtained from a number of models instead of relying on a single one.

Classification and Regression Tree (CART)

The decision tree has two main categories classification tree and regression tree. These two terms at a time called as CART. This term was first coined in 1984 by Leo Breiman, Jerome Friedman, Richard Olshen and Charles Stone. 

Classification

When the response is categorical in nature, the decision tree performs classification. Like the examples, I gave before, whether a person is sick or not or a product is pass or fail in a quality test. In all these cases the problem in hand is to include the target variable into a group. 

The target variable can be a binomial that is with only two categories like yes-no, male-female, sick-not sick etc. or the target variable can be multinomial that is with more than two categories. An example of a multinomial variable can be the economic status of people. It can have categories like very rich, rich, middle class, lower-middle class, poor, very poor etc. Now the benefit of the decision tree is a decision tree is capable of handling both binomial and multinomial variables.

Regression

On the other hand, the decision tree has its application in regression problem when the target variable is of continuous nature. For example, predicting the rainfall of a future date depending on other weather parameters. Here the target variable is a continuous one. So, it is a problem of regression. 

Application of Decision tree with Python

Here we will use the sci-kit learn package to implement the decision tree. The package has a function called DecisionTreeClasifier() which is capable of classifying both binomial (target variable with only two classes) and multinomial (target variable having more than two classes) variables.

Performing classification using decision tree

Importing required libraries

The first step to start coding is to import all the libraries we are going to use. The basic libraries for any kind of data science projects are like pandas, numpy, matplotlib etc. The purpose of these libraries has an elaborate discussion in the article simple linear regression with python.

# importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

About the data

The example dataset I have used here for demonstration purpose is from kaggle.com. The data collected by “National Institute of Diabetes and Digestive and Kidney Diseases”  contains vital parameters of diabetes patients belong to Pima Indian heritage.

Here is a glimpse of the first ten rows of the data set:

Diabetes data set for logistic regression
Diabetes data set for ANN

The data set has independent variables as several physiological parameters of a diabetes patient. The dependent variable is if the patient is suffering from diabetes or not. Here the dependent column contains binary variable 1 indicating the person is suffering from diabetes and 0 he is not a patient of diabetes.

dataset=pd.read_csv('diabetes.csv')
dataset.head()
# Printing data details
print(dataset.info) # for a quick view of the data
print(dataset.head) # printing first few rows of the data
dataset.tail        # to show last few rows of the data
dataset.sample(10)  # display a sample of 10 rows from the data
dataset.describe    # printing summary statistics of the data
pd.isnull(dataset)  # check for any null values in the data
Checking if the dataset has any null value

Creating variables

As we can see that the data frame contains nine variables in nine columns. The first eight columns contain the independent variables. These are some physiological variables having a correlation with diabetes symptoms. The ninth column shows if the patient is diabetic or not. So, here the x stores the independent variables and y stores the dependent variable diabetes count.

x=dataset.iloc[:,:-1].values
y=dataset.iloc[:,-1].values

Performing the classification

To do the classification we need to import the DecisionTreeClassifier() from sklearn. This special classifier is capable of classifying binary variable i.e. variable with only two classes as well as multiclass variables.

# Use of the classifier
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(x, y)

Plotting the tree

Now as the model is ready we can create the tree. The below line will create the tree.

tree.plot_tree()clf

Generally the plot thus created, is of very low resolution and gets distorted while using as image. One solution of this problem is to print it in pdf format, thus the resolution gets maintained.

# The dicision tree creation
tree.plot_tree(clf) 
plt.savefig('DT.pdf')

Another way to print a high resolution and quality image of the tree is to use Graphviz format importing export_graphviz() from tree.

# Creating better graph
import graphviz 
dot_data = tree.export_graphviz(clf, out_file=None) 
graph = graphviz.Source(dot_data) 
graph.render("diabetes") 
Decision tree to classify the data
Decision tree created using Graphviz

The tree represents the logic of classification in a very simple way. We can easily understand how the data has been classified and the steps to achieve that.

Performing regression using decision tree

About the data set

The dataset I have used here for demonstration purpose is from https://www.kaggle.com. The dataset contains the height and weight of persons and a column with their genders. The original dataset has more than thousands of rows, but for this regression purpose, I have used only the first 50 rows containing data on 25 male and 25 females.

Importing libraries

Additional to the basic libraries we imported in a classification problem, here we will need to import the DecisionTreeRegressor() from sklearn.

# Import the necessary modules and libraries
import numpy as np
from sklearn.tree import DecisionTreeRegressor
import matplotlib.pyplot as plt

Reading the dataset

I have already mentioned about the dataset used here for demonstration purpose. The below code is to import the data and store in a dataframe called dataset.

dataset=pd.read_csv('weight-height.csv')
print(dataset)

Here is a glimpse of the dataset

Dataset for random forest regression

Creating variables

As we can see that the dataframe contains three variables in three columns. The last two columns are only of our interest. We want to regress the weight of a person using the height of him/her. So, here the independent variable height is x and the dependent variable weight is y.

x=dataset.iloc[:,1:2].values
y=dataset.iloc[:,-1].values

Splitting the dataset

This is a common practice of splitting the whole data set for creating training and testing data set. Here we have set the test_size as 20% that means the training data set will consist 80% of the total data. The test data set works as an independent data set when need to test the classifier after it gets trained with training data.

# Splitting the data for training and testing
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test=train_test_split(x,y, test_size=0.20, random_state=0)

Fitting the decision tree regression

We have here fitted decision tree regression with two different depth values two draw a comparison between them.

# Creating regression models with two different depths
regr_1 = DecisionTreeRegressor(max_depth=2)
regr_2 = DecisionTreeRegressor(max_depth=5)
regr_1.fit(x_train, y_train)
regr_2.fit(x_train, y_train)

Prediction

The below line of codes will give predictions from both the regression models with two different depth values using a new independent variable set X_test.

# Making prediction
X_test = np.arange(50,75, 0.5)[:, np.newaxis]
y_1 = regr_1.predict(X_test)
y_2 = regr_2.predict(X_test)

Visualizing prediction performance

The below line of codes will generate a height vs weight scattered plot alongwith two prediction lines created from two different regression models.

# Plot the results
plt.figure()
plt.scatter(x, y, s=20, edgecolor="black",
            c="darkorange", label="data")
plt.plot(X_test, y_1, color="cornflowerblue",
         label="max_depth=2", linewidth=2)
plt.plot(X_test, y_2, color="yellowgreen", label="max_depth=5", linewidth=2)
plt.xlabel("Height")
plt.ylabel("Weight")
plt.title("Decision Tree Regression")
plt.legend()
plt.show()

Conclusion

In this post, you have learned about the decision tree and how it can be applied for classification as well as regression problem using scikit-learn of python.

The decision tree is a popular supervised machine learning algorithm and frequently used by data scientists. Its simple logic and easy algorithm are the main reason behind its popularity. Being a white box type algorithm, we can clearly understand how it is doing its work.

The DecisionTreeClassifier() and DecisionTreeRegressor() of scikit-learn are two very useful functions for applying decision tree and I hope you are confident about their use after reading this article.

If you have any question regarding this article or any confusion about its application in python post them in the comment below and I will try my best to answer them.

References

Artificial Neural Network with Python using Keras library

Artificial Neural Network

Artificial Neural Network (ANN) as its name suggests it mimics the neural network of our brain hence it is artificial. The human brain has a highly complicated network of nerve cells to carry the sensation to its designated section of the brain. The nerve cell or neurons form a network and transfer the sensation one to another. Similarly in ANN also a number of inputs pass through several layers similar to neurons and ultimately produce an estimation.

Schematic diagram of Artificial Neural Network
Schematic diagram of Artificial Neural Network
NB: Being a non-native English speaker, I always take extra care to proofread my articles with Grammarly. It is the best grammar and spellchecker available online. Read here my review of using Grammarly for more than two years. 

Perceptron: the simplest Artificial Neural Network

When any ANN consists of only one neuron it is called a perceptron. A perceptron has a single input node as well as a single output node. It is the same as the neuron in our brain consisting of dendrons and axons. 

Depending on your problem, there can be more than one neurons and even layers of neurons. In that situation, it is called multi-layer perceptron. In the above figure, we can see that there are two hidden layers. Generally we used to use ANN with 2-3 hidden layers but theoretically there is no limit.

Layers of an Artificial Neural Network

In the above figure you can see the complete network consists of some layers. Before you start with the application of ANN, understanding these layers is essential. So, here is a brief idea about the layers an ANN has

Input layer

The independent variables having real values are the components of input layer. Input variables can be more than one, discrete or continuous. They may need standardization before feeding into ANN if they have very diverse scale of data.

Hidden layer

The layers between the input and output are called hidden layers. Here the inputs gets associated with some weights and ultimately the weighted sum of all these values are calculated.

The information passed from one layer of neurons acts as inputs for the next layer of neurons. The inputs propagate through the neural network, activation function and cost function then finally yield the output.

Activation function

The weighted sum is then passed through an activation function. It has a very important role in ANN. This function controls the threshold for the output of ANN. Similar to a biological neuron which provides sensation when the impulse exceeds a particular threshold value, the ANN also only gives a particular output when the weighted sum crosses a threshold value.

The output

This is the output of ANN. The activation function yields this output from the weighted sum of the inputs.

ANN: a deep learning process

ANN is a deep learning process, the burning topic of data science. Deep learning is basically a subfield of Machine Learning. You may be familiar to the machine learning process and if not you can refer to this article for a quick working knowledge on it. Talking about deep learning, it is in recent times find its application in almost all ambitious projects. Starting from basic pattern recognition, voice recognition to face recognition, self-driving car, high-end projects in robotics and artificial intelligence deep learning is revolutionizing the modern applied science.

Read about supervised machine learning here

ANN is a very efficient and popular process of pattern recognition. But the process involves complex computations and several iterations. The advent of high-end computing devices and machine learning technologies have made our task much easier than ever. Users and researchers can now focus only on their research problem without taking the pain of implementing a complex ANN algorithm.

As time passes easier to use modules in various languages are developed encapsulating the complexity of such computation processes. The “Keras” is such a framework in Python which has made deep learning and artificial intelligence a common man’s interest and built on rather popular frameworks like TensorFlow, Theano etc. 

Here is an exhaustive article on python and how to use it

 We are going to use here this high-level API Keras to apply ANN.

Application of ANN using Keras library

Importing the libraries

The first step to start coding is to import all the libraries we are going to use. The basic libraries for any kind of data science projects are like pandas, numpy, matplotlib etc. The purpose of these libraries are discussed before in the article simple linear regression with python.

# first neural network with keras tutorial
import pandas as pd
from numpy import loadtxt
from keras.models import Sequential
from keras.layers import Dense

About the data

The example dataset I have used here for demonstration purpose has been downloaded from kaggle.com. The data collected by “National Institute of Diabetes and Digestive and Kidney Diseases”  contains vital parameters of diabetes patients belong to Pima Indian heritage.

Here is a glimpse of the first ten rows of the data set:

Diabetes data set for logistic regression
Diabetes data set for ANN

The data set has independent variables as several physiological parameters of a diabetes patient. The dependent variable is if the patient is suffering from diabetes or not. Here the dependent column contains binary variable 1 indicating the person is suffering from diabetes and 0 he is not a patient of diabetes.

dataset=pd.read_csv('diabetes.csv')
dataset.head()
# Printing data details
print(dataset.info) # for a quick view of the data
print(dataset.head) # printing first few rows of the data
dataset.tail        # to show last few rows of the data
dataset.sample(10)  # display a sample of 10 rows from the data
dataset.describe    # printing summary statistics of the data
pd.isnull(dataset)  # check for any null values in the data
Checking if the dataset has any null value

Creating variables

As we can see that the data frame contains nine variables in nine columns. The first eight columns contain the independent variables which are some physiological variables correlated with diabetes symptoms. The ninth column showes if the patient is diabetic or not. So, here the independent variables are stored in x and the dependent variable diabetes count is stored in y.

x=dataset.iloc[:,:-1].values
y=dataset.iloc[:,-1].values
print(x)
print(y)

Preprocessing the data

This is standard practice before we start with analysis on any data set. Especially if the data set has variables with different scales. In this data also we have variables which have a completely different scale of data. Some of them in fractions whereas some of them with big whole numbers.

To do away with such differences between the variables data standardization is very effective. The preprocessing module of sklearn package has a function called StandardScaler() which does the work for us.

#Normalizing the data
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x = sc.fit_transform(x)

Create a heat map

Before we proceed for analysis, we should have a through idea about the variables in study and their inter relationship. A very handy way to have a quick knowledge about the variables is to create a heat map.

The following code will make a heat map. The seaborn” package has the required function to do this.

# Creating heat map for correlation study
import seaborn as sns
corr = dataset.corr()
sns.heatmap(corr, 
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values)
plt.show()
Heat map for correlation study among the variables
Heat map for correlation study among the variables

The heat map is very good visualization technique to easily apprehend the relation between variables. The colour sheds are the indication of correlation here. The lighter shades depict a high correlation and as the shades get darker the correlation is decreased.

The diagonal elements of a heat map is always one as they are correlation between the same variable. As we expected we can find some variables here which have higher correlation which was not possible to identify from the raw data. For example pregnancies and age, insulin and glucose, skinthikness have a higher correlation.

Splitting the dataset in training and test data

For testing purpose, we need to separate a part of the complete dataset which will not be used for model building. The thumb rule is to use the 80% of data for modelling and keep aside the rest of the data. It will work as an independent dataset. Now we need to test the fitted model’s performance using this independent dataset.

# Splitting the data for training and testing
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test=train_test_split(x,y, test_size=0.20, random_state=0)

Here this data splitting task has been performed with the help of model_selection module of sklearn library. This module has an inbuilt function called train_test_split which automatically divides the dataset into two parts. The argument test_size controls the proportion of the test data. Here the test size is 0.2 so the test dataset will contain 20% of the complete data.

Modelling the data

So we have completed all the prerequisite steps before modelling the data. Here the response variable is a binary variable having 0 and 1 as output. A multilayer perceptron ANN is the best suited to model such data. In this type of ANN, each layer remains connected to each other and works as input layer for the immediate next neuron layer.

For using a multilayer perceptron, Keras sequential model is the easiest way to start. To use sequential model we have used model=sequential(). The activation function here is the most common relu function frequently used to implement neural network using Keras.

# define the keras model
model = Sequential()
model.add(Dense(12, input_dim=8, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

Compiling the model

As the model is defined we will now compile the model with adam optimizer and the loss function called binary_crossentropy. While the training process will continue in several iterations, we can check the model’s accuracy with the [‘accuracy‘] argument passed in metrics function.

# compile the keras model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

While compiling the model these two arguments loss and optimizer plays an important role. The loss function generally depends on the particular problem you are addressing through ANN. For example, if you have a regression problem then the loss function you will be using is Mean Squared Error (MSE).

In this case as we are dealing with a binary response variable so the loss function here is binary_crossentropy. If the response variable consists of more than two classes then the loss function should be categorical_crossentropy.

In a similar way the optimization algorithm used here is adam. There are several others also like RMSprop, Stochastic Gradient Descent (SGD) etc. and their selection has an impact on the tuning model’s learning and momentum.

Fitting the model

Fitting the model has again two crucial parameters. Initializing them with optimum values to a great extent determines model’s efficiency and performance. Here the epochs decides how many iterations will be there through the training set.

And the batch_size is as the name suggests is actually the batch of input samples passed at a time through the ANN. It increases the efficiency of the model as the model does not have to process the whole input at a time.

# fit the keras model on the training set
train=model.fit(x_train, y_train, epochs=100, batch_size=10)

Here I have mentioned batch_size with 10 will enter at a time and total epochs will be 100. See the below output screenshot, here first 10 epochs is captured with the model’s accuracy at every epoch.

Evaluating the model

As the model trained and compiled we can check the model’s accuracy. For the model’s accuracy, Keras has model. evaluate function which gives accuracy value as 68.24. But you have to keep in mind that this accuracy can vary and may get changed each time the ANN runs.

# evaluate the keras model
_,accuracy = model.evaluate(x_train, y_train)
print('Accuracy: %.2f' % (accuracy*100))

Prediction using the model

Now the model is ready for making prediction. The values of x_test are privided as ANN inputs.

# make probability predictions with the model
# make probability predictions with the model
predictions = model.predict(x_test)
# round predictions 
rounded = [round(x[0]) for x in predictions]
print(rounded[:10])
print(y_test[:10])

I have printed here both the predicted y_test results as well as the original y_test values (first 10 values only) and it is clear that the prediction is correct for all of them.

Comparing the predicted values and the original values of test set (first 10 values only)
Comparing the predicted values and the original values of test set (first 10 values only)

Visualizing the models performance

# Visualizing training process with validation and accuracies
import matplotlib.pyplot as plt
plt.plot(train.history['accuracy'])
plt.plot(train.history['val_accuracy'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()
plt.plot(train.history['loss']) 
plt.plot(train.history['val_loss']) 
plt.title('Model loss') 
plt.ylabel('Loss') 
plt.xlabel('Epoch') 
plt.legend(['Train', 'Test'], loc='upper left') 
plt.show()

Conclusion

So we have just completed our first deep learning model to solve a real world problem. This was a very simple problem with a smaller data size just for demonstration purpose. But the basic principal for fitting an ANN will be same everywhere irrespective of data complexity and size. Important is you should know how it works.

Future scope

We have obtained here an accuracy of ANN of 68.24 which has a lot of scopes to get improved. So we need to put further effort to improve the model. You can start with this by tweaking the number of layers the network has, the optimization and loss function used in the model definition and also the epochs and batch_size. Changing these parameters of the model may result in further higher accuracy.

For example in this particular example, if we increase the epochs number from 100 to 200 the accuracy increases to 77% !!!. It is quite a jump in the model efficiency. Likewise simple change in other parameters can also be very helpful.

If there is scope using more sample data in training the model also an effective way of increasing the model’s prediction efficiency. So, once you have a defined model in you hand there is ample scope you can always think of improving it.

Hope this article will help you to take big step forward towards the vast, dynamic and very interesting world of deep learning and AI.

References:

Logistic regression: classify with python

Logistic regression

Logistic regression is a very common and popularly used supervised classification process. When we have categorical data in our hand to make some prediction we tend to apply logistic regression. Classification is a very popular prediction technique. Almost 70% of real-world prediction problems involve categorical variable and hence amenable to classification.

Read about supervised machine learning here

This article covers the basic idea of logistic regression and its implementation with python. The reason behind choosing python to apply logistic regression is simply because Python is the most preferred language among the data scientists. And in the near future also it is going to rule the world of data science.

Here is an exhaustive article on python and how to use it

Why logistic regression not “classification”?

So why the name is “regression” when it performs classification? It is a very natural question you should be asking. So, the answer is it is basically a regression process which becomes a classification process when the process involves a decision threshold for the prediction. Deciding a threshold for the classification process is very important and tricky one too.

We need to decide the decision threshold depending on the particular case in hand. There can be four types of responses in case of classification problems which are “true positive”, “true negative”, “false positive” and “false negative” (will discuss them in a bit while discussing confusion matrix). We have to fix the probability of one type of occurrence while reducing another depending on its severity.

For example, take the case for a severe crime and it is to decide if the person should be hanged or not. It is a problem of binary classification with two outputs guilty or not guilty. Here the true positive case is the person found guilty when he actually has committed the crime. On the other hand, the true negative is the person found guilty when he has not committed the crime.

So, no doubt the true negative case here is of very serious type and should be avoided at any cost. Hence while fixing the decision threshold, you should try to reduce the probability of true negative while fixing the probability of true positive cases.

Here is an exhaustive article on machine learning with python

Logistic regression the basic idea

Though this process is used for classification, basically it is a regression process performed on discrete data. Unlike linear regression predicting response of a continuous variable, in logistic regression, we predict the positive outcome of a binary response variable.

Unlike linear regression which follows a linear function, a logistic regression has a sigmoid function.

Equation for logistic regression
Equation for logistic regression
Linear regression
Logistic regression

Classification types in logistic regression

Binary/binomial classification

In binary classification, the response under study can generally be classified into two groups. Examples of binary classification problems are almost everywhere in the real world.

Be it a medical test result to identify if any patient is suffering from a disease or not, a clinical test to declare a product is pass or fail in industrial quality control parameters to simple predicting whether it will rain or not. All of them are the problems of binary classification. As the response can be of only two types either positive (1) or negative (0) corresponding to every duality like “yes-no”, “pass-fail”, “male-female”, “win-loss” etc.

Multinomial classification

Here the response variable has more than two categories and they have no order. For example category of employees can be group A, Group B and Group C. They can not be arranged in any ascending or descending order.

A good example of such data can be the very famous iris data set of Sir Ronald A. Fisher regarded as the Father of statistics for his remarkable contribution. It is very much popular multivariate dataset and since long has been used as an example data set for any kind of pattern recognition problem.

The data set contains information on 3 species of iris plant with 50 instances about each species. The dependent variable here is the three species of iris plant without any order.

Ordinal classification

In this case like the multinomial variable, the response variable has more than two classes. But here the classes can be ranked in some order. Like the financial status of citizen “very poor”, “poor”, “lower middle class”, “middle class”, “rich”, “very rich”.

Any prediction problem may be a problem of binary classification or regression. Which prediction tool you will use depends on the variable type of the response variable. If the response variable is a categorical variable and have a binary response then binary classification is the solution. On the other hand, if the response is a continuous variable then we have to use regression for prediction.

For example, predicting the price of any product depending on its different specifications is a regression problem. But when we have to determine whether a customer will buy the product or not then it is certainly a problem of binary classification. Because here the response is discrete having only two types of responses possible “buy” and “not buy”.

Learn about supervised machine learning here

Application of logistic regression with python

So, I hope the theoretical part of logistic regression is already clear to you. Now it is time to apply this regression process using python.

So, lets start coding…

About the data

We already know that logistic regression is suitable for categorical data. So, the example dataset I have used here for demonstration purpose has been downloaded from kaggle.com. The data collected by “National Institute of Diabetes and Digestive and Kidney Diseases”  contains vital parameters of diabetes patients belong to Pima Indian heritage.

Here is a glimpse of the first ten rows of the data set:

Diabetes data set for logistic regression
Diabetes data set for logistic regression

The data set has independent variables as several physiological parameters of a diabetes patient. The dependent variable is if the patient is suffering from diabetes or not. Here the dependent column contains binary variable 1 indicating the person is suffering from diabetes and 0 he is not a patient of diabetes.

So, our task is to classify using logistic regression. And to predict as accurately as possible if a person is a diabetes patient from his different other vital parameters.

Importing the libraries

The first step to start coding is to import all the libraries we are going to use. The basic libraries for any kind of data science projects are like pandas, numpy, matplotlib etc. The purpose of these libraries are discussed before in the article simple linear regression with python.

# importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Reading the dataset

I have already mentioned about the dataset used here for demonstration purpose. The below code is to import the data and store in a dataframe called dataset.

dataset=pd.read_csv('diabetes.csv')
print(dataset)

Here is a glimpse of the dataset

Diabetes data frame in python
Diabetes data frame in python

Creating variables

As we can see that the data frame contains nine variables in nine columns. The first eight columns contain the independent variables which are some physiological variables correlated with diabetes symptoms. The ninth column showes if the patient is diabetic or not. So, here the independent variables are stored in x and the dependent variable diabetes count is stored in y.

x=dataset.iloc[:,1:2].values
y=dataset.iloc[:,-1].values
print(x)
print(y)

Splitting the dataset in training and test data

For testing purpose, we need to separate a part of the complete dataset which will not be used for model building. The thumb rule is to use the 80% of data for modelling and keep aside the rest of the data. It will work as an independent dataset. Now we need to test the fitted model’s performance using this independent dataset.

#****** Dividing the dataset into training and testing dataset
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size=0.2, random_state=0)

Here this data splitting task has been performed with the help of model_selection module of sklearn library. This module has an inbuilt function called train_test_split which automatically divides the dataset into two parts. The argument test_size controls the proportion of the test data. Here the test size is 0.2 so the test dataset will contain 20% of the complete data.

Application of logistic regression

Here we will be using the LogisticRegression class from sci-kit learn.

# Importing the logistic regression class and fitting the model
from sklearn.linear_model import LogisticRegression
model=LogisticRegression()
model.fit(x_train, y_train)

After importing LogisticRegression, we will create an instance of the class and then use it to fit the logistic regression on the training dataset.

Predicting using the test data

# Using the fitted model to predict using the test data
y_pred=model.predict(x_test)

As the model has been trained on the training data set, we will use it to get prediction of the test data set. The fitted model will generate a predicted data set called y_pred using x_test. We already know the original values corresponding to x_test which are in y_test. So we can compare how accurate the prediction is.

Calculating fit statistics

# Calculation different statistics to evaluate model fit
from sklearn import metrics
print("Acuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precision score:", metrics.precision_score(y_test, y_pred))
print("Recall score:", metrics.recall_score(y_test, y_pred ))

The sci-kit learn also have a class called metrics which has some useful functions to calculate fit statistics like accuracy score, precision score, recall score etc.

Model validation statistics
Model validation statistics

Here we have all the three statistics calculated. The accuracy score 0.82 suggests a good classification which suggests out of 10 observations the model can classify 8 observations correctly.

The precision and recall score are also good measure of classification process. The precision score is to measure the percentage of correct prediction. In this case, the precision score indicates that if using all the physical parameters of a person the logistic regression predicts that he/she is going to suffer from diabetes, then there is 76% chance that the prediction is correct.

The recall score of 61% says that if the test data set already has some diabetes patients, then in 61% cases the classification process can identify it.

You can further generate a more detailed report on the classification performance using classification_report() function from sci-kit learn. See below…

# Detailed classification report
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
Detailed classification report
Detailed classification report

Creating confusion matrix

Creating a confusion matrix is also an effective way to judge the model. In this case, a 2×2 matrix constitutes true negative, false negative, false positive and false negative values in the four quadrants of the matrix.

A confusion matrix example
A confusion matrix example

The below code is to create the confusion matrix using the metrics class of skit-learn library.

# Creating confusion matrix to check the accuracy of prediction
# import the metrics class
conf_matrix = metrics.confusion_matrix(y_test, y_pred)
conf_matrix
Confusion matrix
Confusion matrix

So, here is the desired confusion matrix. If we compare this matrix with the above model confusion matrix then we can say that the logistic regression has resulted 98 true negative, 9 false positive, 18 false negative and 29 true positive results.

Now, what do they mean? the terms are somewhat technical, so let me explain these terms in respect to this result. Here true negative means when the 0 predictions are correct. So here correct 0 predictions are 98. Likewise in 29 instances, the 1 predictions are correct so these are called true positives, the no. of false positives are 9 that is 9 predictions about 1 are wrong and lastly 18 predictions about 0 are wrong and they are the number of false negatives.

#Creating a heatmap for the confusion matrix
cm=conf_matrix
fig, ax = plt.subplots(figsize=(8, 8))
ax.imshow(cm)
ax.grid(False)
ax.xaxis.set(ticks=(0, 1), ticklabels=('Predicted 0s', 'Predicted 1s'))
ax.yaxis.set(ticks=(0, 1), ticklabels=('Actual 0s', 'Actual 1s'))
ax.set_ylim(1.5, -0.5)
for i in range(2):
    for j in range(2):
        ax.text(j, i, cm[i, j], ha='center', va='center', color='red')
plt.show()

Creating a ROC curve

A Receiver Operating Characteristic (ROC) curve is a good visualization technique to judge the efficiency of classification. The curve plots the true positives over the false positives and hence the optimization and adjustment of sensitivity along with specificity.

# Creating Reciever Operating Characteristic (ROC) curve
y_pred_proba = model.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test,  y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()
ROC curve for logistic regression
ROC curve for logistic regression

Here you can see an AUC score 0.87 which suggests a good classification. The score varies between 0 to 1. A score of 1 suggests perfect classification whereas any score below 0.5 suggests a poor classifier.

Conclusion

Logistic regression is a very uncomplicated classification technique based on a very simple logic. Thus computation resource required by it is comparatively much less. Another big plus of this technique is this process does not require feature scaling. So, no surprise that logistic regression has always been a favourite choice among data scientists to deal with classification problems.

But as a flip side of such simplicity logistic regression is not very efficient to perform classification when there are too many classes among the variables. It is also prone to overfitting and can not handle data with non-linear nature. There are modern machine learning techniques like Naive Bayes, support vector regression, Random Forest, decision tree etc. which are much more capable than logistic regression in handling complex data.

References

Random forest regression and classification using Python

Random forest regression and classification

As you all know that in today’s world of data explosion, machine learning plays a very crucial role to analyze such a huge amount of data. There are several machine learning algorithms which are making our lives easier to handle large database. Random forest algorithm is one of them and can be regarded as the most important and efficient supervised machine learning techniques.

Random forest is a kind of ensemble method of learning technique which makes a more accurate prediction by using more than one models at a time instead of only one machine learning method.

The speciality of the random forest is that it is applicable to both regression and classification problems. When the data is categorical, then it is the problem of classification, on the other hand, if the data is continuous, we should use random forest regression.

Random forest and decision tree

Random forest is a collection of decision trees where each decision tree has trained with a different dataset. The more decision tree a random forest model includes, the more robust and accurate its result becomes. It is like as we consider a forest a robust one if it has many trees. 

Random forest with n number of decision trees
Random forest with n number of decision trees

Random forest actually makes a final prediction from the prediction obtained from each of the decision tree models to overcome the weakness of a single decision tree model. In this sense, the random forest is a bagging type of ensemble technique. 

Now to understand what is bagging we need to know a little about the ensemble method.

Ensemble method 

The random forest provides much more precise result mainly because of the fact that it is a kind of ensemble method, which uses more than one machine learning method at a time to improve the accuracy of the prediction.

A schematic diagram of ensemble method

Bagging

The name is actually Bootstrap Aggregation. It is essentially a random sampling technique with replacement. That means here once a sample unit is selected, it is again replaced back for further future selection. This method works best with algorithms which tend to have higher variance and bias, like decision tree algorithm.

Bagging method runs different model separately and for the final prediction output aggregates each model’s estimation without any bias to any model.

The other ensemble modelling technique is:

Boosting

As an ensemble learning method, boosting also comprises a number of modelling algorithm for prediction. It associates weight to make a weak learning algorithm stronger and thus improving the prediction. The learning algorithms also learn from each other to boost the overall model performance.

In the case of decision tree, the main problem is that the prediction is hugely dependent on the training dataset. As soon as the training data changes, the prediction result also differs. And many a time the decision tree also suffers from the problem of overfitting.  

Advantages of random forest

Different modelling approaches have their own merits and demerits. The beauty of this modelling approach is that it is very efficient in capturing tabular data both numerical and categorical nature with th condition that the category is not more than one hundred. 

It is a single algorithm which is capable of performing both classification and regression tasks depending on the nature of the data. 

Besides as it combines a no. of decision trees in its process, the prediction becomes much more accurate. If we imagine a decision tree as a single tree then the random forest is literally a forest comprising many decision trees, hence the name random forest.

Random forest is capable of handling large database and thousands of input variables.

This machine learning method also comprises a very efficient method of handling missing observation in the dataset.

Application of random forest for regression using Python

This is what you must be waiting for, using python libraries to apply random forest with your data. So lets start coding. We will start with random forest regression with continuous data and then we will take an example of categorical data and apply random forest classification technique.

Random forest regression algorithm of sci-kit learn library is very popular ensemble modelling technique. We will use the RandomForestRegression() class here to perform the regression.

About the data set

The dataset I have used here for demonstration purpose is downloaded from https://www.kaggle.com. The dataset contains the height and weight of persons and a column with their genders. The original dataset has more than thousands of rows, but for this regression purpose, I have used only the first 50 rows containing data on 25 male and 25 females.

So, let’s jump to the most fun part of the article, that is coding with python:

Importing libraries

The first step to start coding is to import all the libraries we are going to use. The basic libraries for any kind of data science projects are like pandas, numpy, matplotlib etc. The purpose of these libraries are discussed before in the article simple linear regression with python.

# importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Reading the dataset

I have already mentioned about the dataset used here for demonstration purpose. The below code is to import the data and store in a dataframe called dataset.

dataset=pd.read_csv('weight-height.csv')
print(dataset)

Here is a glimpse of the dataset

Dataset for random forest regression

Creating variables

As we can see that the dataframe contains three variables in three columns. We are interested in only the last two columns. We want to regress the weight of a person using the height of him/her. So, here the independent variable height is stored in x and the dependent variable weight is stored in y.

x=dataset.iloc[:,1:2].values
y=dataset.iloc[:,-1].values
print(x)
print(y)

Fitting random forest regression

The below code used the RandomForestRegression() class of sklearn to regress weight using height. As the fit is ready, I have used it to create some prediction with some unknown values not used in the fitting process. The predicted weight of a person with height 45.8 is 100.50

# Application of random forest regression  
from sklearn.ensemble import RandomForestRegressor # this is the required algorithm for the task 
regressor = RandomForestRegressor(n_estimators = 100, random_state = 0) 
  
# fitting the random forest regression with the data
regressor.fit(x, y)  
#predicting the output
Y_pred = regressor.predict(np.array([45.8]).reshape(1, 1))  
Y_pred
The predicted value for height 45.8

Creating a fit plot with the predicted values

The following code is to visualize the prediction result against the original values. This is a way through which we can visualize how good the regression is performing.

# Creating a plot with the predicted result
X_grid = np.arange(min(x), max(x), 0.01)  
  
# Making the one dimensional X_grid a two dimensional variable                  
X_grid = X_grid.reshape((len(X_grid), 1)) 
  
# Create a scatter plot with the original variables
plt.scatter(x, y, color = 'blue')   
  
# Creating a line with the predicted data
plt.plot(X_grid, regressor.predict(X_grid),  
         color = 'blue')  
plt.title('Random Forest Regression') 
plt.xlabel('Position level') 
plt.ylabel('Salary') 
plt.show()

So, here is the regression fit plot.

Fit plot for random forest regression
Fit plot for random forest regression

Application of random forest for classification using Python

So, we learned about random forest regression and how we can implement it with python. Now it is time to implement random forest classification. The same sci-kit learn library we used for regression also has a very efficient algorithm for performing this classification process. Here we will apply the RandomForestClassification() function of this library.

So, let’s start coding to perform classification using random forest algorithm.

About the data set

The data set used here is the very famous iris data set of Sir Ronald A. Fisher regarded as the Father of statistics for his remarkable contribution. It is very much popular multivariate dataset and since long has been used as an example data set for any kind of pattern recognition problem.

The data set contains information on 3 species of iris plant with 50 instances about each species. All the three classes are linearly separable from each other. The dependent variable here is the species of iris plant and the three independent variables are sepal length, sepal width, petal length and petal width measured in cm.

The idea behind the data set is that the particular species of any iris plant can be identified with these four variables determining the flower characteristics. Here also we are going to use this random forest classification algorithm to classify the data. And thereafter using that fitted classification model to predict the species of an unknown iris plant using the independent variables.

So, lets start coding…

Importing libraries

The first step to start coding is to import all the libraries we are going to use. The basic libraries for any kind of data science projects are like pandas, numpy, matplotlib etc. and with them sklearn library for the random forest classification algorithm.

Know the functions of all these libraries here.

# importing libraries
import pandas as pd # for dataframe operations
import numpy as np # for matrix operations
from sklearn.model_selection import train_test_split # for splitting the dataset for training and testing dataset
from sklearn import datasets #importing the sklearn library for the iris dataset
from sklearn.ensemble import RandomForestClassifier # for applying random forest classification

Loading the dataset

The iris dataset being a popularly used example dataset is already provided with sklearn library. We need to load the dataset in our workspace before we are going to use it. I am storing the dataset with the name dataset.

# loading the iris dataset
dataset=datasets.load_iris() 

Now to check the dataset we need to check the target and features i.e. the dependent and independent variable classes of the data. Here we will print these information to check them.

print(dataset.target_names) #printing the target names
print(dataset.feature_names)#printing the feature names
the output view

Storing the data into a dataframe

The data is loaded into workspace but until it is in the form of a dataframe we can not apply other data analysis functions. So here lets store the data into a dataframe named test.

# creating a dataframe from the dataset
test=pd.DataFrame({'sepal length':dataset.data[:,0],
                  'sepal width':dataset.data[:,1],
                  'petal length':dataset.data[:,2],
                  'petal width':dataset.data[:,3],
                   'species':dataset.target})
test

Below is a view of few rows of the newly created dataframe of dimension 150X5.

Dataset for random forest classification
View of the dataframe containing the iris dataset

Crating dependent and independent variables

To apply classification algorithm, first of all we need the dependent and independent variables. So here we will store these variables fetching data from dataset.

Now as we have created two variables x and y storing independent and dependent values respectively, we need to split them. This splitting is to create training and testing dataset with a proportion of 80% and 20% of the total data respectively.

# Dividing the data for training and testing 
x=test[['sepal length','sepal width', 'petal width']]
y=test['species']
x_train, x_test,y_train, y_test=train_test_split(x,y,test_size=0.2, random_state=0)

Application of Random Forest Classification

The below code does the main task of classifying the data using the RandomForestClassifier() of sklearn library. Then a variable pred is created to store the predicted values applying the classification fit on the test dataset.

# applying RandomForest classification algorithm
classify=RandomForestClassifier()
classify.fit(x_train, y_train)
pred=classify.predict(x_test)

Checking the accuracy of the classification fit

The sklearn library also has a function called accuracy_score() which tells how accurate the classification is. Here the accuracy value we get is 0.93, which is quite satisfactory.

# testing the accuracy of the result
from sklearn import metrics
print("Acuracy:",metrics.accuracy_score(y_test, pred))

References

Support Vector Regression using Python

Support Vector Regression using Python

Support vector regression (SVR) is a kind of supervised machine learning technique. Though this machine learning technique is mainly popular for classification problems and known as Support Vector Machine, it is well capable to perform regression analysis too. The main emphasis of this article will be to implement support vector regression using python.

Selecting Python for its application is because Python is the future of data science. Python is already the most popular general-purpose programming language amongst the data scientists. Python is an old language and came into existence during the 90s. But it takes decades for the data science enthusiasts to pick it as the most favourite tool. During 2010 it starts to gain popularity very rapidly.

You can get details about python and its most popular IDE pycharm here.

When we use support vector machine for the classification problem, then it is finding out a hyperplane to classify different classes exists in the data. On the other hand, if it is a regression problem then the hyperline is rather a continuous line predicting the response for some known predictors.

Support Vector Regression and hyperplane

See the above figure, here the two classes of observations that are red and blue classes are classified using a hyperlink. It looks very easy, is not it? But sometimes a simple straight line is not enough to classify them. See the below figure.

In this case, no straight line can not completely classify all the points. So here we have to create a third dimension.

As a new third axis has been introduced now we can see that the classes are now can be easily done. Now how it will look if the figure is again converted to its two dimensional version? see it below.

So, a curved hyperline has now separated the classes very effectively. This is what a support vector regression does. It finds a hyperplane to classify the points and then any new point gets assigned its class depending on which side of the hyperplane it resides.

How SVR is different from traditional regression?

It is a very basic question. Why should one go for support vector regression? how it is different from the traditional way of doing regression i.e. OLS (Ordinary Least Square) method?

In OLS method our purpose is to minimize the error as much as possible. Here we try to find a line which has the least distance from all the points. In mathematical notation, the line should fulfil the following condition:

Where yi is the observed response and yi_hat is the predicted response. So, the line should produce the minimum value for the sum of square of the difference between these two values.

But in case of support vector regression, it allows the user to select a range within which the error will be limited. So, the hyperplane/line will be lying within this range set by the researcher. These range is enclosed by two decision boundaries.

So, in the above figure the green line in between is the hyperline. The two black lines at the same distance from the hyperplane are limiting the error of the prediction. The task of support vector regression is to find out this hyperline with maximum number of points between this two decision boundaries.

I think the theoretical idea discussed above will give you a clear enough idea about what is support vector regression and what purpose it serves. With this knowledge we will now dive into its implementation part.

Application of Support Vector Regression using Python

So let’s start our main business that is application of Support Vector Regression using Python. To start coding we have to call the same libraries as we used in Simple Linear Regression and Multiple Linear Regression before.

Calling the libraries

We have to import the Pandas, numpy, matlotlib and seaborn.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn

Importing the dataset

Here I have used an imaginary database which contains data on tree total biomass above the ground and several other tree physical parameters like tree commercial bole height,  diameter, height, first forking height, diameter at breast height, basal area. Tree biomass is the dependent variable here which depends on all other independent variables.

Here is a glimpse of the database:

Dataset for  regression

So here the dependent variable is total_biomass(kg) and we will regress it using all the independent variables here.

dataset=pd.read_csv('tree.csv')
dataset
Glimpse of the database

Describing the dataset

To have a first hand idea about the data in hand, the describe function of Pandas is very useful. All the basic descriptive statistics help us to know the data well.

# take a look of the dataset
dataset.describe()
Descriptive statistics of the dataset

Removing the rows with missing values

This is an important step before you start analysing your data. The raw dataset contains several rows with missing values. These missing values are a big problem during analysis. Thankfully python has a very useful function called dropna() which makes all the rows with missing values disappear.

dataset.columns
#printing the number of rows and columns of the dataset
print(dataset.shape)
# removing the rows with mmissing values
dataset.dropna(inplace=True)
# again print the row and columns to see there is any change in number of rows
print(dataset.shape)
print(dataset.head(5))

Here we can see in the below output, the number of rows and comumns of the dataset has been displayed twice. The values are same in both the cases. This is because the dataset does not have any missing values. So, before and after applying dropna() the rows number is the same.

If there had been missing values, the row numbers in the later case would be lesser.

Producing a heatmap

A heatamp is a very good way to get an idea about the relationship between the variables. The seaborn library has this function of producing heatmap where colour varies from darker shade to lighter one as the correlation between the variables get stronger.

# producing a heatmap to show the correlation between the variables
f=plt.subplots(figsize=(10,10))
sn.heatmap(dataset.corr(),annot=True,fmt='.1f',color='green')
Heat map of the variables showing the correlation between them

Creating variables from the dataset

If you check the data set, the last column is for dependent variable and rest are all for independent variables. So, here I have stored all the independent variables in variable x and the dependent in y.

So, here x is a two dimensional arrow whereas y is one dimensional. To make the variables amenable to further analysis, they need to be two dimensional. So, here a function reshape() has been used to make y a two dimensional array.

x=dataset.iloc[:,: -1].values
y=dataset.iloc[:,-1].values
# to convert the one dimensional array to a two dimensional array
y=y.reshape(-1,1)

Feature scaling of the variables

Before using the variables in support vector regression, they need to be feature scaled. The following code is for transforming both the variables.

# Feature scaling
from sklearn.preprocessing import StandardScaler
std_x=StandardScaler()
std_y=StandardScaler()
x2=std_x.fit_transform(x)
y2=std_y.fit_transform(y)

Fitting the Support Vector Regression

Here comes the most important part of coding where we will perform Support Vector Regression using the SVR() function of SVM module of sklearn library.

# fitting SVR 
from sklearn.svm import SVR
regressor= SVR(kernel='rbf')
regressor.fit(x2,y2)

Visualizing the prediction result of SVR

As we get the model, the next step is to use the model for prediction.

# visualizing the model performance
plt.scatter(x2[:,0],y2,color='red')
plt.scatter(x2[:,0],regressor.predict(x2),color='blue')
plt.title('Prediction result of SVM')
plt.xlabel('Tree CBH')
plt.ylabel('Tree Biomass')

For plotting the predicted output I have selected the variable tree CBH. In the scatter diagram, the red points represent predicted values and the blue ones are the observed values. The predicted value plotted against the independent variable clearly show a close match with the observed values. So, we can conclude that the model performs well enough for predicting Tree Biomass based on different tree physical parameters.

References

Multiple Linear Regression with Python

Multiple linear regression

Multiple linear regression(MLR) is also a kind of linear regression but unlike simple linear regression here we have more than one independent variables. Multiple linear regression is also known as multivariate regression. As in real-world situation, almost all dependent variables are explained by more than variables, so, MLR is the most prevalent regression method and can be implemented through machine learning.

Mathematical equation for Multiple Linear Regression

An MLR model can be expressed as:

Yn = a0 + a1Xn1 + a2Xn2 + ⋯ + aiXi + ∈n → (Xn1 + ⋯ + Xni ) + ∈n

In the above model, the variable Yn represents response for case n and it has a deterministic part and a stochastic part; a0is the intercept, i is no. of independent variables, ai and Xi are the regression coefficients and values of independent variables, respectively and ivaries from 1 to n

The main purpose of applying this regression technique is to develop a model which can explain the variance in the response as much as possible using the independent variables. The ratio of the explained variance by the model to the total variance of the response is known as the coefficient of determination and denoted by R2. We will discuss this statistic in detail later. 

But it is an important parameter in regression modelling to ascertain how good the model is. The value of R2 varies between 0 to 1. Now three situations regarding the fitting of the model we may face which are underfitted model, good fit and overfitted model.

Underfit model

This situation arises when the value of R is low. Low R2 value indicates that the proposed model is not explaining the variation of the response adequately. So, the model needs improvement.

Good-fit model

Like, in this case, we have a good R2 value. Which suggests a good fit of the model and it can be used for prediction.

Overfit model

Sometimes models become too complex with lots of variables and parameters. Such complex models get trained by the data too well and give a very high R2 value almost close to 1.0. But they can not predict well when tested with a different set of data. This is because the model being too complex becomes too specific to a particular situation. Such models are called overfitted models.

Dataset used

The dataset used here is the same we used in the Simple Linear Regression. But in this case all the explanatory/independent variables were considered for modelling purpose. The database is an imaginary one and based on my experience of modelling tree data. 

The dataset contains data on tree total biomass above the ground and several other tree physical parameters like tree commercial bole height,  diameter, height, first forking height, diameter at breast height, basal area. Tree_biomass is the dependent variable here which depends on all other independent variables.

Here is a glimpse of the database:

If you find any difficulty to understand the variables, just don’t bother about their names. Take them as two categories of variables, one is dependent variable, I have denoted it with y here and others are independent variable1, 2, 3 etc. Important is the relationship between these two categories of variables. Whatever their names maybe, you just have to have some experience in their relations.

Assumptions for multiple linear regression

We conduct the regression process assuming some conditions. Without holding these conditions, it is not possible to proceed with the regression process. These are called regression assumptions and they are as below:

Assumption of linearity:

There must be a linear relationship between the independent variables and the response variable. The variables in this imaginary dataset have a linear relationship between them. You can easily check this property by plotting the response variable against each of the explanatory variables. 

Assumption of Homoscedasticity:

The residuals or errors that is the difference between observed and estimated values must have constant variance.

Assumption of multivariate normality:

The residuals should follow a normal distribution. We can prepare a normal quantile-quantile plot to check this assumption.

Assumption of absence of multicollinearity:

There should be no multicollinearity between the independent variables i.e. the independent variables should not be linearly related to each other.

Application of Multiple Linear Regression using Python

The main purpose of this article is to apply multiple linear regression using Python. This is the most important and also the most interesting part. So let’s jump into writing some python code. Like simple linear regression here also the required libraries have to be called first.

Calling the required libraries

We will be using fore main libraries here. For handling data frame and arrays NumPy and panda, for creating plots matplotlib and for metrics operations sklearn. These are the most important libraries for data science applications. 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import metrics

Importing the dataset

To import the tree dataset as mentioned earlier we will use the import function of panda library.

***** Importing the dataset ***********
dataset=pd.read_csv('tree.csv')

Defining variables

Now the next important task is to tell Python about the dependent and independent variables of the dataset. As the protocol says we will store the dependent variable in y and the independent variables in x. As I have already explained above the dataset contains one dependent variable and 7 independent variables.

So we will store the variables in two NumPy arrays. As x has to store 7 independent variables, it has to be a 2-dimensional array. Whereas being a variable with only one column, y can do with one dimension. So, the python code for this purpose is as below:

#***** Defining variables *************
x=dataset.iloc[:,: -1].values
y=dataset.iloc[:,-1].values

Here the “:” denotes the rows. As the dataset contains the dependent value i.e. tree_biomass values as the extreme right column so, python indexes it with -1.

Checking the assumption of the linear relationship between variables

For example, here I have plotted the tree_height against the dependent variable tree_biomass. Although it is evident that with the increase of tree height the biomass will certainly increase. Still, a scatterplot is a very handy visualization technique to double-check the property. You can prepare this plot very easily using the below code:

#********* Plotting dependent variable against any independent variable 
plt.scatter(x[:,2],y) # accessing the variable tree_height
plt.title("Checking linearity between dependent and independent variables")
plt.xlabel("Tree height")
plt.ylabel("Tree biomass")

I have stored the variables in numpy array earlier. So, to access them we have to just mention which variable we intend to plot. For plotting we have used the plt function of matplotlib library.

And here is the plot:

The plot suggests almost a linear relationship between the variables.

Splitting the dataset in training and test data

For testing purpose, we need to separate a part of the complete dataset which will not be used for model building. The thumb rule is to use the 80% of data for modelling and keep aside the rest of the data. It will work as an independent dataset once we come up with the model and need to test it.

#****** Dividing the dataset into training and testing dataset
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size=0.2, random_state=0)

Here this data splitting task has been performed with the help of model_selection module of sklearn library. This module has an inbuilt function called train_test_split which automatically divides the dataset into two parts. The argument test_size controls the proportion of the test data. Here it has been fixed to 0.2 so the test dataset will contain 20% of the complete data.

Application of multiple linear regression

Here comes the main part of this article that is using the regression to regress the response using the known values of more than one independent variables. As in the above section, we have already created train dataset. The following code will use this train data for model building.

#********* Application of regression
from sklearn.linear_model import LinearRegression
regressor=LinearRegression()
regressor.fit(x_train, y_train)

As it is also a linear regression method, so the linear_model module of sklearn library is the one containing the required function LinearRegression. Regressor is an instance created to apply the LinearRegression function.

Getting the regression coefficients for the regression equation

As the regression is done, we need the regression equation. This equation is actually the relation between the dependent and independent variables defined by some coefficients. Using these coefficients we can determine how a unit change in any of the independent variables is going to affect the dependent variable.

#******** Getting the coefficients stored in a dataframe
#*****************************************************************
# storing the column names of independent variables
pos=[1,2,3,4,5,6,7]          
colnames=dataset.columns[pos]
print(colnames)
# creating a dataframe storing the coefficients along with the independent variable names
regressor.intercept_
coef_df=pd.DataFrame(regressor.coef_,colnames,columns=['Coefficients'])
coef_df

In the above section of code, you can see that first of all the position of the independent variables are stored in a variable. And then the corresponding coefficients are fetched from the instance regressor created from LinrarRegression function of linear_model module of sklearn. The coefficients are from regressor.coef_ and the intercept in regressor.intercept_.

Printing the regression coefficients

The regression equation

With the help of these coefficients now we can develop the multiple linear regression.

The multiple linear regression equation

So, this is the final equation for the multiple linear regression model.

Using the model to predict using the test dataset

Now we have the model in our hand. But how can we test its efficiency? If the model is a good one then it should have the capability to predict with precision. And to test that we will need independent data which was not involved during model building.

Here comes the role of test dataset that we kept aside at the very beginning. We will predict the response using the test dataset and compare the prediction with the observations we already have in our hand. The following code will do the trick for us.

And here is the comparison. I have created a dataframe with the observed and predicted values side by side for the ease of comparison.

Comparing the original and predicted values

In the above figure, I have shown only the first 15 values of the dataframe. But it is enough to show that the prediction is satisfactory. 

Goodness of fit of the model 

We have tested the data and got a good prediction using the model. However, we have not quantified yet. We do not have any number to ascertain how good is the model. Here I will discuss such fit statistics that are very useful in this respect. If we have to compare multiple models then these numbers play a crucial role to find the best out of them.

The following code will deliver fit statistics popularly used to judge the goodness of any statistical model. These are coefficient of determination denoted as R2 is the proportion of variance exists in the response variable explained by the proposed model. So the higher its value better is the model. 

Coefficient of determination (R2)

Suppose our test dataset has n set of independent and dependent variables i.e. (x1,x2,…,xn), (y1,y2,…,yn)respectively. Now using our developed model the prediction we achieved has the predicted values (v1,v2,…,vn). So, the total sum of square will be:

This is the total existing variation in the response variable.

Now the variation explained by the model we developed is the regression sum of square and can be calculated as

So as the definition of the coefficient of determination goes, it can be calculated as:

Again it can be farther simplified by breaking down the regression sum of square as the variance explained subtracting the unexplained variance from the total variance. The unexplained variance is actually the variance the model is not able to explain. It is also known as error or residual sum of square and calculated as:

So, now we can rewrite the equation of R2 as

#****** Calculating fit statistics
r_square=regressor.score(x_train, y_train)
print('Coefficient of determination(R square):',r_square)
from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_predict))
print('Mean Squared Error:', metrics.mean_squared_error(y_test,y_predict))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_predict)))

Mean Absolute Error(MAE)

This is another popular measure for model fit. As the name suggests, it is the simple difference between observed and predicted values. As we are only interested in the deviations, so we will take here the absolute value of the differences. So the expression will be:

As it measures the error of the estimated values so a lowe MAE suggests better model.

Mean Squared Error (MSE)

This is also a measure of the deviation of the model estimation from that of the original values. But instead of the absolute values, we will take the squared values of the deviations. So many a time it is also called Mean Squared Deviation (MSD) and calculated as:

Root Mean Squared Error (RMSE)

As the name suggests, this measure of fit first calculates the difference between the observed and model-predicted values, takes the square of each error then calculates the mean and ultimately calculates the square root to get the RMSE. So its equation is:

Fit statistics

How can the fitting further be improved?

There is always scope for improving the model so that it can give more precise prediction. As we already know that the main purpose of Multiple Linear Regression is to ascribe the variance of response variable as much as possible amongst the independent variables.

Now here lies the trick of improving the prediction of multiple linear regression model. The response variable you are dealing here with gets affected by a number of explanatory variables. Some of them are straight way visible to us and we can say with confindence that they are main contributor towards the response. And all together they can give you a good explanation too.

But with a good knowledge of the domain one can identify many other variables that are not directly recognizable as causal effects. For an example if we take the example of any agriculture experiment, crop yield is determined by so many direct, indirect, physiological, chemical, weather variable, soil condition etc.

So, the skill and domain knowledge of the researcher play a viral role to choose variable wisely in order to improve the model’s fit. Using too less variable will result in a poor R2 whereas using too many variables may produce a very complex model with a very high R2. In both of these scenario model’s performance will not be up to the mark.

References:

  • https://www.wikipedia.org/
  • https://www.statisticshowto.com/
  • https://towardsdatascience.com/

Simple linear regression with Python

Simple Linear Regression

Simple linear regression is the most basic form of regression. It is the foundation of statistical or machine learning modelling technique. All advance techniques you may use in future will be based on the idea and concepts of linear regression. It is the most primary skill to explore your data and have the first look into it. 

Simple linear regression is a statistical model which studies the relationship between two variables. These two variables will be such that one of them is dependent on the other. A simple example of such two variables can be the height and weight of the human body. From our experience, we know that the bodyweight of any person is correlated with his height.

The body weight changes as the height changes. So here body weight and height are dependent and independent variable respectively. The task of simple linear regression is to quantify the change happens in the dependent variables for a unit change in the independent variable.

Mathematical expression

We can express this relationship using a mathematical equation. If we express a person’s height and weight with X and Y respectively, then a simple linear regression equation will be:

Y=a.X+b

With this equation, we can estimate the dependent variable corresponding to any known independent variable. Simple linear regression helps us to estimate the coefficients of this equation.  As a is known now, we can say for one unit change in X, there will be exactly a unit change in Y.

See the figure below, the a in the equation is actually the slope of the line and b is the intercept from X-axis.

Simple linear regression

As the primary focus of this post is to implement simple linear regression through Python, so I would not go deeper into the theoretical part of it. Rather we will jump straight into the application of it. 

Before we start coding with Python, we should know about the essential libraries we will need to implement this. The three basic libraries are NumPy,  pandas and matplotlib. I will discuss about these libraries briefly in a bit.

Application of Python for simple linear regression

I know you were waiting for this part only. So, here is the main part of this post i.e. how we can implement simple linear regression using Python. For demonstration purpose I have selected an imaginary database which contains data on tree total biomass above the ground and several other tree physical parameters like tree commercial bole height,  diameter, height, first forking height, diameter at breast height, basal area. Tree biomass is the dependent variable here which depends on all other independent variables.

Here is a glimpse of the database:

Dataset for  regression

From this complete dataset, we will use only Tree_height_m and Tree_biomass (kg) for this present demonstration. So, here the dataset name is tree_height and has the look as below:

Dataset for Simple linear regression

Python code for simple linear regression

Importing required libraries

Before you start the coding, the first task is to import the required libraries. Give them a short name to refer them easily in the later part of coding.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

These are the topmost important libraries for data science applications. These libraries contain several classes and functions which make performing data analysis tasks in Python super easy. 

For example, numPy and Pandas are the two libraries which encapsulate all the matrix and vector operation functions. They allow users to perform complex matrix operations required for machine learning and artificial intelligence research with a very intuitive manner. Actually the name numPy comes from “Numeric Python”.

Whereas Matplotlib is a full-fledged plotting library and works as an extension of numPy. The main function of this library to provide an object-oriented API for useful graphs and plots embedded in the applications itself.

These libraries get automatically installed if you are installing Python from Anaconda, which is a free and opensource resource for R and Python for data science computation. So as the libraries are already installed you have to just import them.

Importing dataset

dataset=pd.read_csv('tree_height.csv')
x=dataset.iloc[:,:-1].values
y=dataset.iloc[:, 1].values

Before you use this piece of code, make sure the .csv file you are about to import is located in the same working directory where the Python file is located. Otherwise, the compiler will not be able to find the file.

Then we have to create two variables to store the independent and dependent data. Here the use of matrix needs special mention. Please keep in mind that the dataset I have used has the dependent (Y) variable in the last column. So, while storing the independent variable in x, the last column is excluded and for dependent variable y, the location of the last column is considered.

Splitting the dataset in training and testing data

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test=train_test_split(x,y,test_size=1/4, random_state=0)

This is of utmost importance when we are performing statistical modelling. Any model developed should be tested with an independent dataset which has net been used for model building. As we have only one dataset in our hand so, I have created two independent datasets with 80:20 ratio. 

The train data consists of 80% of the data and used for training the model. Whereas rest of the 20% data was kept aside for testing the model. Luckily the famous sklearn library for Python already has a module called model_selection which contains a function called train_test_split.  We can easily get this data split task done using this library.

Application of linear regression

from sklearn.linear_model import LinearRegression
regressor=LinearRegression()
regressor.fit(x_train,y_train)

This is the main part where the regression takes place using Linear Regression function of sklearn library.

Printing coefficients

#To retrieve the intercept:
print(regressor.intercept_)
#For retrieving the slope:
print(regressor.coef_)

Here we can get the expression of the linear regression equation with the slope  and intercept constant.

Validation plot to check homoscedasticity assumption

#***** Plotting residual errors in training data
plt.scatter(regressor.predict(x_train), (regressor.predict(x_train)-y_train),
            color='blue', s=10, label = 'Train data')
# ******Plotting residual errors in testing data
plt.scatter(regressor.predict(x_test),regressor.predict(x_test)-y_test,
            color='red',s=10,label = 'Test data')
#******Plotting reference line for zero residual error
plt.hlines(y=0,xmin=0,xmax=60)
plt.title('Residual Vs Predicted plot for train and test data set')
plt.xlabel('Residuals')
plt.ylabel('Predicted values')

For the data used here this part will create a plot like this:

This part is for checking an important assumption of a linear regression which is the residuals are homoscedastic. That means the residuals have equal variance. If this assumption fails then the whole regression process does not stand.

Predicting the test results

y_predict=regressor.predict(x_test)

The independent test dataset is now in use to predict the result using the newly developed model.

Printing actual and predicted values

new_dataset=pd.DataFrame({'Actual':y_test.flatten(), 'Predicted':y_predict.flatten()})
new_dataset

Creating scatterplot using the training set

plt.scatter(x_train, y_train, color='red')
plt.plot(x_train, regressor.predict(x_train), color='blue')
plt.title('Tree heihgt vs tree weight')
plt.xlabel('Tree height (m)')
plt.ylabel('Tree wieght (kg)')

Visualization of model’s performance using test set data

plt.scatter(x_test, y_test, color='red')
plt.plot(x_test, regressor.predict(x_test), color='blue')
plt.title('Tree heihgt vs tree weight')
plt.xlabel(‘Tree height (m)')
plt.ylabel('Tree wieght (kg)')

Calculating fit statistics for the model

r_square=regressor.score(x_train, y_train)
print('Coefficient of determination(R square):',r_square)
from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_predict))
print('Mean Squared Error:', metrics.mean_squared_error(y_test,y_predict))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_predict)))

This is th final step of finding the goodness of fit of the model. This piece of code generates some statistics which will quantitatively tell the performance of your model. Here the most important and popular four fit statistics are calculated. Except for the coefficient of determination, the lower the value of all other statistics better is the model.

References:

Getting started with Python for Machine Learning: beginners guide

Getting started with Python for Machine Learning

If you are reading this article, then you are a Machine Learning enthusiast without any doubt. You must have already gone through the theoretical basics of it and getting impatient to try hand in your first Machine Learning application. Python is the most popular programming language for machine learning. I would suggest that if you want a carrier in data science it is Python which you should bet on for.

Learn about two main types of Machine Learning 
>Supervised machine learning
>Unsupervised machine learning

So, this article is for you. Here I will demonstrate how to complete with the setup of the Python and to start with your first simple programming.

But first of all the question is….

Why Python for machine learning?

Why I have chosen Python to carry on Machine Learning? There are lots of tools available and some of them are very popular too. For example, R is a very reputed language and also present there for a long time. 

Especially people with traditional statistical or mathematical background have a strong inclination towards R too. One of the reasons behind this popularity is R actually came into existence replacing S which was a pure statistical programming language developed on C platform and hence was hugely popular amongst statisticians.

Python Vs R

R was developed in 1992 and has a specific edge for data analysis tasks. And being a procedural language it breaks down the total tasks into a series of steps and procedures. Both of R and Python being open source are freely available to use and online resources are huge.

R is mainly helpful for core statistical and data analytics purpose. The language was developed by statisticians keeping the need for statisticians in mind mainly. It has very powerful graphical functions like ggplot, ggvis, shiny etc. If you want to create eye-catching plots from your data, R should be your best friend.

On the other hand, Python came a little early in 1989 developed by Guido Van Rossum, a Dutch scientist. It has a slow steady growth till 2010 but after that with the start of data explosion era, its popularity also shoots quickly. 

The main reason behind is so quick popularity is its simplicity and versatility. Machine Learning and Artificial Intelligence have many complex algorithms to perform several complex tasks. But the beauty of Python is that it makes tasks easy for both machine learning and AI with its vast collection of simple to use functions.

Use of Python in data science is just one of its capability. Being a general-purpose language, Python can be used for developing web applications, software, mobile applications development and even read-modifying files connecting to the database. This versatility of this language has won the heart of millions of people irrespective of whether they are data scientists or computer science enthusiasts.

If you are a beginner in data science you can jump-start the learning and application of Python even with little or no background in programming languages. It is also a far better performer compare to R when it comes to analyzing large size database.

The following chart from Economist.com will help you to realize how popular Python has become recently surpassing all other big names like Java, R, C++ etc.

Source: steelkiwi.com, economist.com

In the data science world these two programming languages are close competitors. Both of them are very popular and have their own plus and minuses. And ultimately which platform you should use is purely your choice. 

Having said this, I think the popularity and simplicity of Python in its application in machine learning will keep it slightly ahead of R. And if you are looking ahead to build a career as a data scientist, in my opinion, the future is brighter with Python skill.

Setting up Python in your computer

To start with python application, the first step is to install Python in your computer. If your desktop/laptop is a new one, then there is a chance that it might have Python preinstalled in it. You can check your start menu for it. If you get it there then skip this step.

Download Python

If it is not installed already then you have to download and install it from Python.org. 

Python for machine learning

So, from here you download the specific Python version that suits your computer and download it. As of today Python, 3.8.2 is the latest version so you can download it. And if you have an old system and run Windows XP then you have to download an old compatible version preferably lower than Python 3.5.

After you downloaded the file, click it to start the installation. Just go with the recommended installation process. It is a quick process and within minutes python is installed in your system.

Python for machine learning

The following window will appear as the Python installation is finished.

Python for machine learning

Now you can check your computer start menu and the python folder with associated applications will be there.

Python for machine learning

As I have installed it just now so it is having a “New” tag with all its application. Now as the python is installed you can directly launch its application and start your code. 

Python for machine learning

Here in the above screenshot, you can see that the console is showing all the details of the Python version installed. I have also done some basic command like print and simple calculation.

But to start with your Python coding we will need a good IDE which will help us with Python syntax writing in an intuitive way.

Selecting a Python IDE

Although while installing Python a simple IDE called IDLE gets installed automatically. We prefer to use a more popular and advanced IDE called PyCharm. The reason is to get familiar with one IDE of any programming language takes significant time. So, we should choose a good IDE to start with so that we can continue our task in it.

PyCharm is currently the most popular IDE for python. See the following table which compares some popular Python IDEs. PyCharm also comes with a paid version. But you will get full-featured integrated environment in both of them.

Python for machine learning

Source: www.softwaretestinghelp.com

Except for thesse IDEs some simple text editor like Notepad++ is also very popular amongst data scientists. The only issue with text editors is that you have to use some additional plugins to compile the code written in them. In this context IDEs come handy and you can do the complete task starting from writing code to its compilation in there itself.

Having said these, the final selection is completely on your choice. Python being the most popular programming language, users have the luxury of choosing an IDE from a vast collection of it. And honestly speaking, it is difficult to judge any single IDE to be the best one. Every one of them has its own strengths and weakness.  

So though I have selected PyCharm here, you can select any other too. It will not take you much time to switch between IDEs once you make your basics strong.

So, let’s start installing PyCharm

PyCharm is a product of Jetbrains. Open the concern page following this link

https://www.jetbrains.com/pycharm/download/#section=windows

Step 1: 

Click the download button under Community, it is the free open source version of PyCharm

Python for machine learning

Step: 2

Start the downloaded application by clicking the next button.

Python for machine learning

Step: 3

In the next step, if you want to change the program location then provide the path or you can go with the default path assigned. I am here going with the default folder. Then click next.

Python for machine learning

Step: 4

Next window will allow you to create a desktop icon of PyCharm and you can also update the path variable. To proceed click next.

Python for machine learning

Step: 5

Here you can change the start menu folder, then click “Install”.

Python for machine learning

Step: 6

The installation will start. It takes a few minutes. As the installation is done click next.

Step: 7

In the next window click “Finish” to complete the installation process.

Starting PyCharm for the first time

Now PyCharm is installed on your computer. Go to your computer start menu and launch the programme. The first window appears is of the privacy policy. Click on to agree with the terms & conditions and click continue.

Next is the data sharing window. It’s completely your choice. Choose any of the options and proceed.

In the next window, you will get to choose the appearance of your IDE. Choose any of them you feel comfortable with and click skip remaining and set defaults. You can change all these options anytime you want later.

The next window is important which allows you to choose the location where you want to create your Python project. For me, I like to save all my important files at the cloud, so I have provided that particular path there. You can change it here or later.

So now you are all set to start your journey with Python programming with PyCharm IDE. 

References:

  • https://www.python.org
  • https://towardsdatascience.com
  • https://www.geeksforgeeks.org
  • https://steelkiwi.com/blog

Unsupervised Machine Learning: a detailed discussion

Unsupervised Machine Learning

Unsupervised Machine Learning is a kind of Machine Learning where the algorithm identifies some hidden pattern in the data on its own. This type of Machine Learning is used when there is no labeled data available to train the algorithm. 

Unlike Supervised Machine Learning here the input dataset is not tagged with some known answers. This is because in many cases we need to predict such situations which are completely new. The experimenter has no experience about the data in hand, its distribution and parameters are also unknown. 

So, in this case, the application of Supervised Learning is not feasible. So we have to go for Unsupervised Machine Learning. The main problem with this approach is that we have no test dataset labeled with the correct answer to check the accuracy of such an unsupervised learning process. That’s why it has lesser accuracy than supervised learning.

Learning process of a baby

Application of unsupervised learning resembles the learning process of babies. They start learning process themselves at the first. No one teaches them. They start identifying objects from their experience.

Learning process of a baby is similar to the unsupervised learning
Photo by Guillaume de Germain on Unsplash

For example, since birth they see human and no one teach them about characteristics of it. But whenever the baby sees a new human around he matches the characteristics and recognizes the new object as a human being. This is a very basic example of unsupervised learning.

Application of Unsupervised Machine Learning

Although this approach has a problem of lesser accuracy, it is useful to find out hidden pattern in the data. 

Speech recognition

You might have used google’s speech recognition tool. It is such a handy tool to convert your speech into text. When you have to write a lot of text, you can certainly use it to your advantage. I also use it frequently during writing my articles in Google doc.

So the point is the technology used for this handy tool for speech recognition is nothing but unsupervised machine learning. The annotation process from voice to text is very costly so, labeled data is not available to train the algorithm.

Detection of anomaly

Unsupervised classification can also come handy to detect extreme values in the dataset. Such data generally comprises outliers which are erroneous observation due to mechanical error or error during data collection, fraudulent transaction data in bank transaction statement likewise.

Clustering of data

Clustering is a grouping of data on the basis of some uniformity. It reveals the data structure and helps to design the classifier. 

Finds hidden patterns and feature of the data

Unsupervised learning finds out all kinds of hidden pattern and features of the which consequently helps in categorization.

Issues with unsupervised machine learning

  • The process has some inherent issues which you must consider before its application. 
  • Unsupervised learning results are less accurate compare to that of supervised learning and it is very obvious too.
  • Performing unsupervised learning is much more complicated than a supervised one.
  • Validation of the model is not possible due to lack of labeled data.

Types of unsupervised machine learning

Unsupervised machine learning can be further grouped into two broad categories which are clustering and association problems.

Clustering

It is of great importance when we discuss unsupervised learning. This technique finds out some similarity in the uncategorized data and groups them to create different clusters. This clustering process is hugely beneficial to gather some basic information about the data in hand. For finding patterns and features of the dataset which is otherwise completely unknown to the researchers

Clustering in unsupervised machine learning

We can decide how many clusters we should create. The clusters are so formed so that the within-cluster variance is lower compare to between cluster variance. In similarity measure it can be phrased as the members of a cluster are similar whereas members of different clusters are dissimilar.

We perform this clustering through several approaches. 

Hierarchical clustering

Here every data point is considered an individual cluster to start with. Then in similarity basis, the most similar data points are clubbed to form a single cluster. This process continues until the decided number of clusters is achieved.

Probabilistic clustering

Here as the name suggests, we do the clustering on the basis of a probability distribution. For example, if there are keywords like 

“Boys’ school”

“Girls’ school”

“Girls’ college”

“Boys’ college”

Then the clusters can form two categories either “boy” and “girl” or “school” and “college”

Exclusive clustering

If data points are such that they are very exclusive to a particular category. Then in a straight manner, we form the clusters according to data points exclusivity. Here no single data point can belong to more than one clusters.

Overlapping clustering

In contrast to exclusive clusters in overlapping clustering, one particular data point can belong to more than one clusters. To achieve such clustering, we use fuzzy sets.

Clustering algorithms

There are some popular algorithms to perform clustering. In this article, I will briefly discuss them. Each of them will have an elaborate discussion in separate articles.

K-means 

K-means clustering is a type of clustering where data points are grouped into k clusters. If the value of k is large them the cluster size is small and if k has small value then cluster size is bigger.

Every cluster has a value called the centroid. This is kind of the heart of the cluster. The distance of other data points from this centroid determines if they qualify for the cluster or not.

K- Nearest Neighbors

It is a simple algorithm and performs well when there is a significant distance between the sample data points. It is the most simple classification method under unsupervised machine learning but takes considerable time when the dataset is large.

Principal Component analysis

It is a variable reduction technique. The basic objective of PCA is to calculate fewer number of new variables maintaining the variance of the data as explained by the original variables.

Hierarchical clustering

This is a hierarchical clustering technique. Hierarchical in the sense that it starts with considering each data points as a cluster and then goes on forming clusters by including close clusters. This process continues until only one cluster remains.

Fuzzy K-means

This is a more generalized form of K-means clustering. Here also clusters are formed using a centroid value. But the difference is that in simple K-means clustering, the data points are either same as the centroid or it is different, there is no in-between position; whereas in fuzzy k-means clustering algorithm assigns a probability to each data points depending on its distance from the centroid. K-means clustering simply a special case of fuzzy K-means clustering where the probability is either 1 or 0.

Association

This also about pattern or feature identification from large database. Unsupervised machine learning uses this association rules to find out the interesting relationship between variables. For example, students in a class can be a subject of this association rule based on their choice of subject.

Summary

So, we  can summarize some important points about unsupervised machine learning which are as follows:

Unsupervised machine learning is the type of machine learning where we don’t use any lebeled data.

No labeled data, so no supervision of the result and no validation

It has less accuracy compare to that of supervised machine learning

Unsupervised learning is more complicated than supervised learning

Unsupervised learning proves helpful when we have no idea about the data, its distribution and parameters are also unknown.

Two main methods of conducting unsupervised machine learning are clustering and association.

References:

  • https://towardsdatascience.com
  • https://www.guru99.com
  • https://www.geeksforgeeks.org

Supervised Machine Learning: a beginner’s guide

Supervised Machine Learning

The most common type of Machine Learning is Supervised Machine Learning. The nomenclature is due to the fact that the learning process being supervised by the result which is already known. The learning process goes through several iterations. The process continues until the difference between the actual and estimated result comes under an acceptable level. 

“Computers are able to see, hear and learn. Welcome to the future.”

~Dave Waters. Department of Earth Sciences, University of Oxford Associate Professor of Metamorphic Petrology (retired)

The data used in supervised machine learning are called “labelled data” because these data are already tagged with the right answer. Once the training part is complete and a robust model is achieved, some new inputs are provided. The task of the model now is to predict the label of this unforeseen inputs based on the labelled data used before.

In mathematical notation, it can be represented as the output variable Y which is a function of input variable X

Y=f(X)

During the training phase of supervised machine learning both X and Y remains unknown. The algorithm tries to find out the mapping function which can predict the Y most precisely.

Example of Supervised machine learning

You must have come across the term pattern recognition from any online or offline source. This is a kind off buzz word today and is in use to make our life more sophisticated and comfortable. Starting from a very simple application like your smartphone’s face recognition or handwriting recognition to advance use of cancer cell detection, this supervised learning is the essence of pattern recognition.

Its simple applications are already making our lives easier be it your smartphone’s face lock feature, handwriting recognition or your voice recognition. The auto-driving car concept also heavily depends on supervised learning concept. In every sector of the industry, you can find presence of this theory nowadays.

An application in agriculture

Now to understand how this system works we will take an example of its application in the agriculture field. 

Application of supervised machine learning
Application of supervised machine learning
Photo by Roman Synkevych on Unsplash

Prediction for the crop yield well before its harvesting is very essential for proper policy planning. It helps the government to fix its price, to provide better storage of the produce and farmers also able to plan its marketing channels if there is a precise prediction about how much production is expected.

Now crop yield is determined by several factors, some of them are physical parameters of the crop itself like crop height, number of tillers etc. weather parameters like rainfall, humidity, sunshine hours etc. other than these soil health factors like carbon balance, organic matters and several others play an important role and contribute to the ultimate yield.

Now if we have a sufficient amount of labelled data that is a set of data which has all these independent variables affecting the yield along with the corresponding yield, we can train the algorithm with this training dataset. So, it will be supervised learning. As if the learning process has been supervised by any teacher.

The learning process stops only when a robust model is achieved and the prediction is of an acceptable level.

A real-world problem solved by Supervised Machine learning

Here I am going to cite an example of supervised learning in modern research and how it is being used to address complex problems of the real world.

A Project work was taken up by a group of scientists to identify the endangered species of Mojave desert of California. The main objective of the study was to locate the two threatened species Mohave Ground Squirrel and desert tortoise of the area by analyzing images captured by smartphones. 

The challenge faced by the biologists was to track and rescue these two endangered species as they were very tough to spot. Nature has given them such a capability to camouflage with the desert background and vegetation that it becomes almost impossible for the human eye to see them. 

So here the scientists used computer vision and develop a machine learning algorithm to identify the pattern, distinguish it from the desert backdrop and classify them according to the characteristics.

Types of supervised machine learning

There are two main categories of supervised machine learning.

  • Classification
  • Regression 
Supervised Machine Learning, its categories and popular algorithms
Supervised Machine Learning, its categories and popular algorithms

Classification:

It is applicable when the variable in hand is a categorical variable and the objective is to classify it. If the algorithm classifies into two classes, it is called binary classification and if the number of classes is more than two, then it is called multiclass classification. 

Classification
Classification in Supervised Machine Learning

In the given figure, a binary classification has been demonstrated. Here a group of people has been classified according to their genders depending on a dataset consisting their height and weight.

The task is done in the same way as discussed before. First of all, the algorithm is trained with a dataset with an assigned category. Then based on this training the algorithm has categorized the values when provided with an input data.

Example of classification

A most common example of classification problem is identifying if a new mail is a spam or not spam, identifying loan defaulters also a problem of classification. 

The algorithm is provided with a dataset of mails and a corresponding column indicating if it is a spam or not spam. Similarly, a list is first provided with the customers labelled with if they are a loan defaulter or not to train the algorithm. Then the supervised learning model is used to identify the type of customer from an independent input dataset.

There are a number of algorithms for classification. The most popular ones are

  • Naive Baye’s theorem
  • Linear classifier
  • Support vector machine
  • Random forest
  • Decision tree
  • K-Nearest neighbour

Regression

Regression is a statistical process which tries to find out the relationship between the dependent and independent variables. The major difference with classification is that in regression we deal with continuous variables.

If a regression equation is a linear one between the independent and dependent variables then it is a simple linear regression equation. If the regression equation of Y on X is linear, then it does not necessarily suggest that the regression equation of X on Y is also linear and vice-versa. The dependent variable a function of independent variables with respective constant parameters and an error term which is again a random variable. A regression model has the expression:

Y=f 0,1,2,…, n+ϵ

Where Y is the dependent variable, X1, X2+…Xn are independent variables, 0,1,2,…, n are the regression coefficients and is the error term and normally distributed with mean 0 and variance 2.  This type of regression model is also known as a deterministic model.

Example of regression

Regression
Regression in Supervised Machine Learning

An example of simple linear regression can be regressing the weight of a group of people on the basis of their height. Here Height and weight are the independent and dependent variable respectively. As a person height determines his weight, not the vice versa.

The blue line in the above figure is the regression line fitted with a supervised machine learning technique. This represents the best-fitted line obtained through a rigorous training process until a robust model with acceptable accuracy is achieved.

To perform regression a number of algorithms are used by researchers. The most frequently used ones are:

  • Simple linear regression
  • Multiple linear regression
  • Logistic regression
  • Polynomial regression etc.