Multiple Linear Regression with Python

Multiple linear regression

Multiple linear regression(MLR) is also a kind of linear regression but unlike simple linear regression here we have more than one independent variables. Multiple linear regression is also known as multivariate regression. As in real-world situation, almost all dependent variables are explained by more than variables, so, MLR is the most prevalent regression method and can be implemented through machine learning.

Mathematical equation for Multiple Linear Regression

An MLR model can be expressed as:

Yn = a0 + a1Xn1 + a2Xn2 + ⋯ + aiXi + ∈n → (Xn1 + ⋯ + Xni ) + ∈n

In the above model, the variable Yn represents response for case n and it has a deterministic part and a stochastic part; a0is the intercept, i is no. of independent variables, ai and Xi are the regression coefficients and values of independent variables, respectively and ivaries from 1 to n

The main purpose of applying this regression technique is to develop a model which can explain the variance in the response as much as possible using the independent variables. The ratio of the explained variance by the model to the total variance of the response is known as the coefficient of determination and denoted by R2. We will discuss this statistic in detail later. 

But it is an important parameter in regression modelling to ascertain how good the model is. The value of R2 varies between 0 to 1. Now three situations regarding the fitting of the model we may face which are underfitted model, good fit and overfitted model.

Underfit model

This situation arises when the value of R is low. Low R2 value indicates that the proposed model is not explaining the variation of the response adequately. So, the model needs improvement.

Good-fit model

Like, in this case, we have a good R2 value. Which suggests a good fit of the model and it can be used for prediction.

Overfit model

Sometimes models become too complex with lots of variables and parameters. Such complex models get trained by the data too well and give a very high R2 value almost close to 1.0. But they can not predict well when tested with a different set of data. This is because the model being too complex becomes too specific to a particular situation. Such models are called overfitted models.

Dataset used

The dataset used here is the same we used in the Simple Linear Regression. But in this case all the explanatory/independent variables were considered for modelling purpose. The database is an imaginary one and based on my experience of modelling tree data. 

The dataset contains data on tree total biomass above the ground and several other tree physical parameters like tree commercial bole height,  diameter, height, first forking height, diameter at breast height, basal area. Tree_biomass is the dependent variable here which depends on all other independent variables.

Here is a glimpse of the database:

If you find any difficulty to understand the variables, just don’t bother about their names. Take them as two categories of variables, one is dependent variable, I have denoted it with y here and others are independent variable1, 2, 3 etc. Important is the relationship between these two categories of variables. Whatever their names maybe, you just have to have some experience in their relations.

Assumptions for multiple linear regression

We conduct the regression process assuming some conditions. Without holding these conditions, it is not possible to proceed with the regression process. These are called regression assumptions and they are as below:

Assumption of linearity:

There must be a linear relationship between the independent variables and the response variable. The variables in this imaginary dataset have a linear relationship between them. You can easily check this property by plotting the response variable against each of the explanatory variables. 

Assumption of Homoscedasticity:

The residuals or errors that is the difference between observed and estimated values must have constant variance.

Assumption of multivariate normality:

The residuals should follow a normal distribution. We can prepare a normal quantile-quantile plot to check this assumption.

Assumption of absence of multicollinearity:

There should be no multicollinearity between the independent variables i.e. the independent variables should not be linearly related to each other.

Application of Multiple Linear Regression using Python

The main purpose of this article is to apply multiple linear regression using Python. This is the most important and also the most interesting part. So let’s jump into writing some python code. Like simple linear regression here also the required libraries have to be called first.

Calling the required libraries

We will be using fore main libraries here. For handling data frame and arrays NumPy and panda, for creating plots matplotlib and for metrics operations sklearn. These are the most important libraries for data science applications. 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import metrics

Importing the dataset

To import the tree dataset as mentioned earlier we will use the import function of panda library.

***** Importing the dataset ***********
dataset=pd.read_csv('tree.csv')

Defining variables

Now the next important task is to tell Python about the dependent and independent variables of the dataset. As the protocol says we will store the dependent variable in y and the independent variables in x. As I have already explained above the dataset contains one dependent variable and 7 independent variables.

So we will store the variables in two NumPy arrays. As x has to store 7 independent variables, it has to be a 2-dimensional array. Whereas being a variable with only one column, y can do with one dimension. So, the python code for this purpose is as below:

#***** Defining variables *************
x=dataset.iloc[:,: -1].values
y=dataset.iloc[:,-1].values

Here the “:” denotes the rows. As the dataset contains the dependent value i.e. tree_biomass values as the extreme right column so, python indexes it with -1.

Checking the assumption of the linear relationship between variables

For example, here I have plotted the tree_height against the dependent variable tree_biomass. Although it is evident that with the increase of tree height the biomass will certainly increase. Still, a scatterplot is a very handy visualization technique to double-check the property. You can prepare this plot very easily using the below code:

#********* Plotting dependent variable against any independent variable 
plt.scatter(x[:,2],y) # accessing the variable tree_height
plt.title("Checking linearity between dependent and independent variables")
plt.xlabel("Tree height")
plt.ylabel("Tree biomass")

I have stored the variables in numpy array earlier. So, to access them we have to just mention which variable we intend to plot. For plotting we have used the plt function of matplotlib library.

And here is the plot:

The plot suggests almost a linear relationship between the variables.

Splitting the dataset in training and test data

For testing purpose, we need to separate a part of the complete dataset which will not be used for model building. The thumb rule is to use the 80% of data for modelling and keep aside the rest of the data. It will work as an independent dataset once we come up with the model and need to test it.

#****** Dividing the dataset into training and testing dataset
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size=0.2, random_state=0)

Here this data splitting task has been performed with the help of model_selection module of sklearn library. This module has an inbuilt function called train_test_split which automatically divides the dataset into two parts. The argument test_size controls the proportion of the test data. Here it has been fixed to 0.2 so the test dataset will contain 20% of the complete data.

Application of multiple linear regression

Here comes the main part of this article that is using the regression to regress the response using the known values of more than one independent variables. As in the above section, we have already created train dataset. The following code will use this train data for model building.

#********* Application of regression
from sklearn.linear_model import LinearRegression
regressor=LinearRegression()
regressor.fit(x_train, y_train)

As it is also a linear regression method, so the linear_model module of sklearn library is the one containing the required function LinearRegression. Regressor is an instance created to apply the LinearRegression function.

Getting the regression coefficients for the regression equation

As the regression is done, we need the regression equation. This equation is actually the relation between the dependent and independent variables defined by some coefficients. Using these coefficients we can determine how a unit change in any of the independent variables is going to affect the dependent variable.

#******** Getting the coefficients stored in a dataframe
#*****************************************************************
# storing the column names of independent variables
pos=[1,2,3,4,5,6,7]          
colnames=dataset.columns[pos]
print(colnames)
# creating a dataframe storing the coefficients along with the independent variable names
regressor.intercept_
coef_df=pd.DataFrame(regressor.coef_,colnames,columns=['Coefficients'])
coef_df

In the above section of code, you can see that first of all the position of the independent variables are stored in a variable. And then the corresponding coefficients are fetched from the instance regressor created from LinrarRegression function of linear_model module of sklearn. The coefficients are from regressor.coef_ and the intercept in regressor.intercept_.

Printing the regression coefficients

The regression equation

With the help of these coefficients now we can develop the multiple linear regression.

The multiple linear regression equation

So, this is the final equation for the multiple linear regression model.

Using the model to predict using the test dataset

Now we have the model in our hand. But how can we test its efficiency? If the model is a good one then it should have the capability to predict with precision. And to test that we will need independent data which was not involved during model building.

Here comes the role of test dataset that we kept aside at the very beginning. We will predict the response using the test dataset and compare the prediction with the observations we already have in our hand. The following code will do the trick for us.

And here is the comparison. I have created a dataframe with the observed and predicted values side by side for the ease of comparison.

Comparing the original and predicted values

In the above figure, I have shown only the first 15 values of the dataframe. But it is enough to show that the prediction is satisfactory. 

Goodness of fit of the model 

We have tested the data and got a good prediction using the model. However, we have not quantified yet. We do not have any number to ascertain how good is the model. Here I will discuss such fit statistics that are very useful in this respect. If we have to compare multiple models then these numbers play a crucial role to find the best out of them.

The following code will deliver fit statistics popularly used to judge the goodness of any statistical model. These are coefficient of determination denoted as R2 is the proportion of variance exists in the response variable explained by the proposed model. So the higher its value better is the model. 

Coefficient of determination (R2)

Suppose our test dataset has n set of independent and dependent variables i.e. (x1,x2,…,xn), (y1,y2,…,yn)respectively. Now using our developed model the prediction we achieved has the predicted values (v1,v2,…,vn). So, the total sum of square will be:

This is the total existing variation in the response variable.

Now the variation explained by the model we developed is the regression sum of square and can be calculated as

So as the definition of the coefficient of determination goes, it can be calculated as:

Again it can be farther simplified by breaking down the regression sum of square as the variance explained subtracting the unexplained variance from the total variance. The unexplained variance is actually the variance the model is not able to explain. It is also known as error or residual sum of square and calculated as:

So, now we can rewrite the equation of R2 as

#****** Calculating fit statistics
r_square=regressor.score(x_train, y_train)
print('Coefficient of determination(R square):',r_square)
from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_predict))
print('Mean Squared Error:', metrics.mean_squared_error(y_test,y_predict))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_predict)))

Mean Absolute Error(MAE)

This is another popular measure for model fit. As the name suggests, it is the simple difference between observed and predicted values. As we are only interested in the deviations, so we will take here the absolute value of the differences. So the expression will be:

As it measures the error of the estimated values so a lowe MAE suggests better model.

Mean Squared Error (MSE)

This is also a measure of the deviation of the model estimation from that of the original values. But instead of the absolute values, we will take the squared values of the deviations. So many a time it is also called Mean Squared Deviation (MSD) and calculated as:

Root Mean Squared Error (RMSE)

As the name suggests, this measure of fit first calculates the difference between the observed and model-predicted values, takes the square of each error then calculates the mean and ultimately calculates the square root to get the RMSE. So its equation is:

Fit statistics

How can the fitting further be improved?

There is always scope for improving the model so that it can give more precise prediction. As we already know that the main purpose of Multiple Linear Regression is to ascribe the variance of response variable as much as possible amongst the independent variables.

Now here lies the trick of improving the prediction of multiple linear regression model. The response variable you are dealing here with gets affected by a number of explanatory variables. Some of them are straight way visible to us and we can say with confindence that they are main contributor towards the response. And all together they can give you a good explanation too.

But with a good knowledge of the domain one can identify many other variables that are not directly recognizable as causal effects. For an example if we take the example of any agriculture experiment, crop yield is determined by so many direct, indirect, physiological, chemical, weather variable, soil condition etc.

So, the skill and domain knowledge of the researcher play a viral role to choose variable wisely in order to improve the model’s fit. Using too less variable will result in a poor R2 whereas using too many variables may produce a very complex model with a very high R2. In both of these scenario model’s performance will not be up to the mark.

References:

  • https://www.wikipedia.org/
  • https://www.statisticshowto.com/
  • https://towardsdatascience.com/

Simple linear regression with Python

Simple Linear Regression

Simple linear regression is the most basic form of regression. It is the foundation of statistical or machine learning modelling technique. All advance techniques you may use in future will be based on the idea and concepts of linear regression. It is the most primary skill to explore your data and have the first look into it. 

Simple linear regression is a statistical model which studies the relationship between two variables. These two variables will be such that one of them is dependent on the other. A simple example of such two variables can be the height and weight of the human body. From our experience, we know that the bodyweight of any person is correlated with his height.

The body weight changes as the height changes. So here body weight and height are dependent and independent variable respectively. The task of simple linear regression is to quantify the change happens in the dependent variables for a unit change in the independent variable.

Mathematical expression

We can express this relationship using a mathematical equation. If we express a person’s height and weight with X and Y respectively, then a simple linear regression equation will be:

Y=a.X+b

With this equation, we can estimate the dependent variable corresponding to any known independent variable. Simple linear regression helps us to estimate the coefficients of this equation.  As a is known now, we can say for one unit change in X, there will be exactly a unit change in Y.

See the figure below, the a in the equation is actually the slope of the line and b is the intercept from X-axis.

Simple linear regression

As the primary focus of this post is to implement simple linear regression through Python, so I would not go deeper into the theoretical part of it. Rather we will jump straight into the application of it. 

Before we start coding with Python, we should know about the essential libraries we will need to implement this. The three basic libraries are NumPy,  pandas and matplotlib. I will discuss about these libraries briefly in a bit.

Application of Python for simple linear regression

I know you were waiting for this part only. So, here is the main part of this post i.e. how we can implement simple linear regression using Python. For demonstration purpose I have selected an imaginary database which contains data on tree total biomass above the ground and several other tree physical parameters like tree commercial bole height,  diameter, height, first forking height, diameter at breast height, basal area. Tree biomass is the dependent variable here which depends on all other independent variables.

Here is a glimpse of the database:

Dataset for  regression

From this complete dataset, we will use only Tree_height_m and Tree_biomass (kg) for this present demonstration. So, here the dataset name is tree_height and has the look as below:

Dataset for Simple linear regression

Python code for simple linear regression

Importing required libraries

Before you start the coding, the first task is to import the required libraries. Give them a short name to refer them easily in the later part of coding.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

These are the topmost important libraries for data science applications. These libraries contain several classes and functions which make performing data analysis tasks in Python super easy. 

For example, numPy and Pandas are the two libraries which encapsulate all the matrix and vector operation functions. They allow users to perform complex matrix operations required for machine learning and artificial intelligence research with a very intuitive manner. Actually the name numPy comes from “Numeric Python”.

Whereas Matplotlib is a full-fledged plotting library and works as an extension of numPy. The main function of this library to provide an object-oriented API for useful graphs and plots embedded in the applications itself.

These libraries get automatically installed if you are installing Python from Anaconda, which is a free and opensource resource for R and Python for data science computation. So as the libraries are already installed you have to just import them.

Importing dataset

dataset=pd.read_csv('tree_height.csv')
x=dataset.iloc[:,:-1].values
y=dataset.iloc[:, 1].values

Before you use this piece of code, make sure the .csv file you are about to import is located in the same working directory where the Python file is located. Otherwise, the compiler will not be able to find the file.

Then we have to create two variables to store the independent and dependent data. Here the use of matrix needs special mention. Please keep in mind that the dataset I have used has the dependent (Y) variable in the last column. So, while storing the independent variable in x, the last column is excluded and for dependent variable y, the location of the last column is considered.

Splitting the dataset in training and testing data

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test=train_test_split(x,y,test_size=1/4, random_state=0)

This is of utmost importance when we are performing statistical modelling. Any model developed should be tested with an independent dataset which has net been used for model building. As we have only one dataset in our hand so, I have created two independent datasets with 80:20 ratio. 

The train data consists of 80% of the data and used for training the model. Whereas rest of the 20% data was kept aside for testing the model. Luckily the famous sklearn library for Python already has a module called model_selection which contains a function called train_test_split.  We can easily get this data split task done using this library.

Application of linear regression

from sklearn.linear_model import LinearRegression
regressor=LinearRegression()
regressor.fit(x_train,y_train)

This is the main part where the regression takes place using Linear Regression function of sklearn library.

Printing coefficients

#To retrieve the intercept:
print(regressor.intercept_)
#For retrieving the slope:
print(regressor.coef_)

Here we can get the expression of the linear regression equation with the slope  and intercept constant.

Validation plot to check homoscedasticity assumption

#***** Plotting residual errors in training data
plt.scatter(regressor.predict(x_train), (regressor.predict(x_train)-y_train),
            color='blue', s=10, label = 'Train data')
# ******Plotting residual errors in testing data
plt.scatter(regressor.predict(x_test),regressor.predict(x_test)-y_test,
            color='red',s=10,label = 'Test data')
#******Plotting reference line for zero residual error
plt.hlines(y=0,xmin=0,xmax=60)
plt.title('Residual Vs Predicted plot for train and test data set')
plt.xlabel('Residuals')
plt.ylabel('Predicted values')

For the data used here this part will create a plot like this:

This part is for checking an important assumption of a linear regression which is the residuals are homoscedastic. That means the residuals have equal variance. If this assumption fails then the whole regression process does not stand.

Predicting the test results

y_predict=regressor.predict(x_test)

The independent test dataset is now in use to predict the result using the newly developed model.

Printing actual and predicted values

new_dataset=pd.DataFrame({'Actual':y_test.flatten(), 'Predicted':y_predict.flatten()})
new_dataset

Creating scatterplot using the training set

plt.scatter(x_train, y_train, color='red')
plt.plot(x_train, regressor.predict(x_train), color='blue')
plt.title('Tree heihgt vs tree weight')
plt.xlabel('Tree height (m)')
plt.ylabel('Tree wieght (kg)')

Visualization of model’s performance using test set data

plt.scatter(x_test, y_test, color='red')
plt.plot(x_test, regressor.predict(x_test), color='blue')
plt.title('Tree heihgt vs tree weight')
plt.xlabel(‘Tree height (m)')
plt.ylabel('Tree wieght (kg)')

Calculating fit statistics for the model

r_square=regressor.score(x_train, y_train)
print('Coefficient of determination(R square):',r_square)
from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_predict))
print('Mean Squared Error:', metrics.mean_squared_error(y_test,y_predict))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_predict)))

This is th final step of finding the goodness of fit of the model. This piece of code generates some statistics which will quantitatively tell the performance of your model. Here the most important and popular four fit statistics are calculated. Except for the coefficient of determination, the lower the value of all other statistics better is the model.

References:

Getting started with Python for Machine Learning: beginners guide

Getting started with Python for Machine Learning

If you are reading this article, then you are a Machine Learning enthusiast without any doubt. You must have already gone through the theoretical basics of it and getting impatient to try hand in your first Machine Learning application. Python is the most popular programming language for machine learning. I would suggest that if you want a carrier in data science it is Python which you should bet on for.

Learn about two main types of Machine Learning 
>Supervised machine learning
>Unsupervised machine learning

So, this article is for you. Here I will demonstrate how to complete with the setup of the Python and to start with your first simple programming.

But first of all the question is….

Why Python for machine learning?

Why I have chosen Python to carry on Machine Learning? There are lots of tools available and some of them are very popular too. For example, R is a very reputed language and also present there for a long time. 

Especially people with traditional statistical or mathematical background have a strong inclination towards R too. One of the reasons behind this popularity is R actually came into existence replacing S which was a pure statistical programming language developed on C platform and hence was hugely popular amongst statisticians.

Python Vs R

R was developed in 1992 and has a specific edge for data analysis tasks. And being a procedural language it breaks down the total tasks into a series of steps and procedures. Both of R and Python being open source are freely available to use and online resources are huge.

R is mainly helpful for core statistical and data analytics purpose. The language was developed by statisticians keeping the need for statisticians in mind mainly. It has very powerful graphical functions like ggplot, ggvis, shiny etc. If you want to create eye-catching plots from your data, R should be your best friend.

On the other hand, Python came a little early in 1989 developed by Guido Van Rossum, a Dutch scientist. It has a slow steady growth till 2010 but after that with the start of data explosion era, its popularity also shoots quickly. 

The main reason behind is so quick popularity is its simplicity and versatility. Machine Learning and Artificial Intelligence have many complex algorithms to perform several complex tasks. But the beauty of Python is that it makes tasks easy for both machine learning and AI with its vast collection of simple to use functions.

Use of Python in data science is just one of its capability. Being a general-purpose language, Python can be used for developing web applications, software, mobile applications development and even read-modifying files connecting to the database. This versatility of this language has won the heart of millions of people irrespective of whether they are data scientists or computer science enthusiasts.

If you are a beginner in data science you can jump-start the learning and application of Python even with little or no background in programming languages. It is also a far better performer compare to R when it comes to analyzing large size database.

The following chart from Economist.com will help you to realize how popular Python has become recently surpassing all other big names like Java, R, C++ etc.

Source: steelkiwi.com, economist.com

In the data science world these two programming languages are close competitors. Both of them are very popular and have their own plus and minuses. And ultimately which platform you should use is purely your choice. 

Having said this, I think the popularity and simplicity of Python in its application in machine learning will keep it slightly ahead of R. And if you are looking ahead to build a career as a data scientist, in my opinion, the future is brighter with Python skill.

Setting up Python in your computer

To start with python application, the first step is to install Python in your computer. If your desktop/laptop is a new one, then there is a chance that it might have Python preinstalled in it. You can check your start menu for it. If you get it there then skip this step.

Download Python

If it is not installed already then you have to download and install it from Python.org. 

Python for machine learning

So, from here you download the specific Python version that suits your computer and download it. As of today Python, 3.8.2 is the latest version so you can download it. And if you have an old system and run Windows XP then you have to download an old compatible version preferably lower than Python 3.5.

After you downloaded the file, click it to start the installation. Just go with the recommended installation process. It is a quick process and within minutes python is installed in your system.

Python for machine learning

The following window will appear as the Python installation is finished.

Python for machine learning

Now you can check your computer start menu and the python folder with associated applications will be there.

Python for machine learning

As I have installed it just now so it is having a “New” tag with all its application. Now as the python is installed you can directly launch its application and start your code. 

Python for machine learning

Here in the above screenshot, you can see that the console is showing all the details of the Python version installed. I have also done some basic command like print and simple calculation.

But to start with your Python coding we will need a good IDE which will help us with Python syntax writing in an intuitive way.

Selecting a Python IDE

Although while installing Python a simple IDE called IDLE gets installed automatically. We prefer to use a more popular and advanced IDE called PyCharm. The reason is to get familiar with one IDE of any programming language takes significant time. So, we should choose a good IDE to start with so that we can continue our task in it.

PyCharm is currently the most popular IDE for python. See the following table which compares some popular Python IDEs. PyCharm also comes with a paid version. But you will get full-featured integrated environment in both of them.

Python for machine learning

Source: www.softwaretestinghelp.com

Except for thesse IDEs some simple text editor like Notepad++ is also very popular amongst data scientists. The only issue with text editors is that you have to use some additional plugins to compile the code written in them. In this context IDEs come handy and you can do the complete task starting from writing code to its compilation in there itself.

Having said these, the final selection is completely on your choice. Python being the most popular programming language, users have the luxury of choosing an IDE from a vast collection of it. And honestly speaking, it is difficult to judge any single IDE to be the best one. Every one of them has its own strengths and weakness.  

So though I have selected PyCharm here, you can select any other too. It will not take you much time to switch between IDEs once you make your basics strong.

So, let’s start installing PyCharm

PyCharm is a product of Jetbrains. Open the concern page following this link

https://www.jetbrains.com/pycharm/download/#section=windows

Step 1: 

Click the download button under Community, it is the free open source version of PyCharm

Python for machine learning

Step: 2

Start the downloaded application by clicking the next button.

Python for machine learning

Step: 3

In the next step, if you want to change the program location then provide the path or you can go with the default path assigned. I am here going with the default folder. Then click next.

Python for machine learning

Step: 4

Next window will allow you to create a desktop icon of PyCharm and you can also update the path variable. To proceed click next.

Python for machine learning

Step: 5

Here you can change the start menu folder, then click “Install”.

Python for machine learning

Step: 6

The installation will start. It takes a few minutes. As the installation is done click next.

Step: 7

In the next window click “Finish” to complete the installation process.

Starting PyCharm for the first time

Now PyCharm is installed on your computer. Go to your computer start menu and launch the programme. The first window appears is of the privacy policy. Click on to agree with the terms & conditions and click continue.

Next is the data sharing window. It’s completely your choice. Choose any of the options and proceed.

In the next window, you will get to choose the appearance of your IDE. Choose any of them you feel comfortable with and click skip remaining and set defaults. You can change all these options anytime you want later.

The next window is important which allows you to choose the location where you want to create your Python project. For me, I like to save all my important files at the cloud, so I have provided that particular path there. You can change it here or later.

So now you are all set to start your journey with Python programming with PyCharm IDE. 

References:

  • https://www.python.org
  • https://towardsdatascience.com
  • https://www.geeksforgeeks.org
  • https://steelkiwi.com/blog

Unsupervised Machine Learning: a detailed discussion

Unsupervised Machine Learning

Unsupervised Machine Learning is a kind of Machine Learning where the algorithm identifies some hidden pattern in the data on its own. This type of Machine Learning is used when there is no labeled data available to train the algorithm. 

Unlike Supervised Machine Learning here the input dataset is not tagged with some known answers. This is because in many cases we need to predict such situations which are completely new. The experimenter has no experience about the data in hand, its distribution and parameters are also unknown. 

So, in this case, the application of Supervised Learning is not feasible. So we have to go for Unsupervised Machine Learning. The main problem with this approach is that we have no test dataset labeled with the correct answer to check the accuracy of such an unsupervised learning process. That’s why it has lesser accuracy than supervised learning.

Learning process of a baby

Application of unsupervised learning resembles the learning process of babies. They start learning process themselves at the first. No one teaches them. They start identifying objects from their experience.

Learning process of a baby is similar to the unsupervised learning
Photo by Guillaume de Germain on Unsplash

For example, since birth they see human and no one teach them about characteristics of it. But whenever the baby sees a new human around he matches the characteristics and recognizes the new object as a human being. This is a very basic example of unsupervised learning.

Application of Unsupervised Machine Learning

Although this approach has a problem of lesser accuracy, it is useful to find out hidden pattern in the data. 

Speech recognition

You might have used google’s speech recognition tool. It is such a handy tool to convert your speech into text. When you have to write a lot of text, you can certainly use it to your advantage. I also use it frequently during writing my articles in Google doc.

So the point is the technology used for this handy tool for speech recognition is nothing but unsupervised machine learning. The annotation process from voice to text is very costly so, labeled data is not available to train the algorithm.

Detection of anomaly

Unsupervised classification can also come handy to detect extreme values in the dataset. Such data generally comprises outliers which are erroneous observation due to mechanical error or error during data collection, fraudulent transaction data in bank transaction statement likewise.

Clustering of data

Clustering is a grouping of data on the basis of some uniformity. It reveals the data structure and helps to design the classifier. 

Finds hidden patterns and feature of the data

Unsupervised learning finds out all kinds of hidden pattern and features of the which consequently helps in categorization.

Issues with unsupervised machine learning

  • The process has some inherent issues which you must consider before its application. 
  • Unsupervised learning results are less accurate compare to that of supervised learning and it is very obvious too.
  • Performing unsupervised learning is much more complicated than a supervised one.
  • Validation of the model is not possible due to lack of labeled data.

Types of unsupervised machine learning

Unsupervised machine learning can be further grouped into two broad categories which are clustering and association problems.

Clustering

It is of great importance when we discuss unsupervised learning. This technique finds out some similarity in the uncategorized data and groups them to create different clusters. This clustering process is hugely beneficial to gather some basic information about the data in hand. For finding patterns and features of the dataset which is otherwise completely unknown to the researchers

Clustering in unsupervised machine learning

We can decide how many clusters we should create. The clusters are so formed so that the within-cluster variance is lower compare to between cluster variance. In similarity measure it can be phrased as the members of a cluster are similar whereas members of different clusters are dissimilar.

We perform this clustering through several approaches. 

Hierarchical clustering

Here every data point is considered an individual cluster to start with. Then in similarity basis, the most similar data points are clubbed to form a single cluster. This process continues until the decided number of clusters is achieved.

Probabilistic clustering

Here as the name suggests, we do the clustering on the basis of a probability distribution. For example, if there are keywords like 

“Boys’ school”

“Girls’ school”

“Girls’ college”

“Boys’ college”

Then the clusters can form two categories either “boy” and “girl” or “school” and “college”

Exclusive clustering

If data points are such that they are very exclusive to a particular category. Then in a straight manner, we form the clusters according to data points exclusivity. Here no single data point can belong to more than one clusters.

Overlapping clustering

In contrast to exclusive clusters in overlapping clustering, one particular data point can belong to more than one clusters. To achieve such clustering, we use fuzzy sets.

Clustering algorithms

There are some popular algorithms to perform clustering. In this article, I will briefly discuss them. Each of them will have an elaborate discussion in separate articles.

K-means 

K-means clustering is a type of clustering where data points are grouped into k clusters. If the value of k is large them the cluster size is small and if k has small value then cluster size is bigger.

Every cluster has a value called the centroid. This is kind of the heart of the cluster. The distance of other data points from this centroid determines if they qualify for the cluster or not.

K- Nearest Neighbors

It is a simple algorithm and performs well when there is a significant distance between the sample data points. It is the most simple classification method under unsupervised machine learning but takes considerable time when the dataset is large.

Principal Component analysis

It is a variable reduction technique. The basic objective of PCA is to calculate fewer number of new variables maintaining the variance of the data as explained by the original variables.

Hierarchical clustering

This is a hierarchical clustering technique. Hierarchical in the sense that it starts with considering each data points as a cluster and then goes on forming clusters by including close clusters. This process continues until only one cluster remains.

Fuzzy K-means

This is a more generalized form of K-means clustering. Here also clusters are formed using a centroid value. But the difference is that in simple K-means clustering, the data points are either same as the centroid or it is different, there is no in-between position; whereas in fuzzy k-means clustering algorithm assigns a probability to each data points depending on its distance from the centroid. K-means clustering simply a special case of fuzzy K-means clustering where the probability is either 1 or 0.

Association

This also about pattern or feature identification from large database. Unsupervised machine learning uses this association rules to find out the interesting relationship between variables. For example, students in a class can be a subject of this association rule based on their choice of subject.

Summary

So, we  can summarize some important points about unsupervised machine learning which are as follows:

Unsupervised machine learning is the type of machine learning where we don’t use any lebeled data.

No labeled data, so no supervision of the result and no validation

It has less accuracy compare to that of supervised machine learning

Unsupervised learning is more complicated than supervised learning

Unsupervised learning proves helpful when we have no idea about the data, its distribution and parameters are also unknown.

Two main methods of conducting unsupervised machine learning are clustering and association.

References:

  • https://towardsdatascience.com
  • https://www.guru99.com
  • https://www.geeksforgeeks.org

Supervised Machine Learning: a beginner’s guide

Supervised Machine Learning

The most common type of Machine Learning is Supervised Machine Learning. The nomenclature is due to the fact that the learning process being supervised by the result which is already known. The learning process goes through several iterations. The process continues until the difference between the actual and estimated result comes under an acceptable level. 

“Computers are able to see, hear and learn. Welcome to the future.”

~Dave Waters. Department of Earth Sciences, University of Oxford Associate Professor of Metamorphic Petrology (retired)

The data used in supervised machine learning are called “labelled data” because these data are already tagged with the right answer. Once the training part is complete and a robust model is achieved, some new inputs are provided. The task of the model now is to predict the label of this unforeseen inputs based on the labelled data used before.

In mathematical notation, it can be represented as the output variable Y which is a function of input variable X

Y=f(X)

During the training phase of supervised machine learning both X and Y remains unknown. The algorithm tries to find out the mapping function which can predict the Y most precisely.

Example of Supervised machine learning

You must have come across the term pattern recognition from any online or offline source. This is a kind off buzz word today and is in use to make our life more sophisticated and comfortable. Starting from a very simple application like your smartphone’s face recognition or handwriting recognition to advance use of cancer cell detection, this supervised learning is the essence of pattern recognition.

Its simple applications are already making our lives easier be it your smartphone’s face lock feature, handwriting recognition or your voice recognition. The auto-driving car concept also heavily depends on supervised learning concept. In every sector of the industry, you can find presence of this theory nowadays.

An application in agriculture

Now to understand how this system works we will take an example of its application in the agriculture field. 

Application of supervised machine learning
Application of supervised machine learning
Photo by Roman Synkevych on Unsplash

Prediction for the crop yield well before its harvesting is very essential for proper policy planning. It helps the government to fix its price, to provide better storage of the produce and farmers also able to plan its marketing channels if there is a precise prediction about how much production is expected.

Now crop yield is determined by several factors, some of them are physical parameters of the crop itself like crop height, number of tillers etc. weather parameters like rainfall, humidity, sunshine hours etc. other than these soil health factors like carbon balance, organic matters and several others play an important role and contribute to the ultimate yield.

Now if we have a sufficient amount of labelled data that is a set of data which has all these independent variables affecting the yield along with the corresponding yield, we can train the algorithm with this training dataset. So, it will be supervised learning. As if the learning process has been supervised by any teacher.

The learning process stops only when a robust model is achieved and the prediction is of an acceptable level.

A real-world problem solved by Supervised Machine learning

Here I am going to cite an example of supervised learning in modern research and how it is being used to address complex problems of the real world.

A Project work was taken up by a group of scientists to identify the endangered species of Mojave desert of California. The main objective of the study was to locate the two threatened species Mohave Ground Squirrel and desert tortoise of the area by analyzing images captured by smartphones. 

The challenge faced by the biologists was to track and rescue these two endangered species as they were very tough to spot. Nature has given them such a capability to camouflage with the desert background and vegetation that it becomes almost impossible for the human eye to see them. 

So here the scientists used computer vision and develop a machine learning algorithm to identify the pattern, distinguish it from the desert backdrop and classify them according to the characteristics.

Types of supervised machine learning

There are two main categories of supervised machine learning.

  • Classification
  • Regression 
Supervised Machine Learning, its categories and popular algorithms
Supervised Machine Learning, its categories and popular algorithms

Classification:

It is applicable when the variable in hand is a categorical variable and the objective is to classify it. If the algorithm classifies into two classes, it is called binary classification and if the number of classes is more than two, then it is called multiclass classification. 

Classification
Classification in Supervised Machine Learning

In the given figure, a binary classification has been demonstrated. Here a group of people has been classified according to their genders depending on a dataset consisting their height and weight.

The task is done in the same way as discussed before. First of all, the algorithm is trained with a dataset with an assigned category. Then based on this training the algorithm has categorized the values when provided with an input data.

Example of classification

A most common example of classification problem is identifying if a new mail is a spam or not spam, identifying loan defaulters also a problem of classification. 

The algorithm is provided with a dataset of mails and a corresponding column indicating if it is a spam or not spam. Similarly, a list is first provided with the customers labelled with if they are a loan defaulter or not to train the algorithm. Then the supervised learning model is used to identify the type of customer from an independent input dataset.

There are a number of algorithms for classification. The most popular ones are

  • Naive Baye’s theorem
  • Linear classifier
  • Support vector machine
  • Random forest
  • Decision tree
  • K-Nearest neighbour

Regression

Regression is a statistical process which tries to find out the relationship between the dependent and independent variables. The major difference with classification is that in regression we deal with continuous variables.

If a regression equation is a linear one between the independent and dependent variables then it is a simple linear regression equation. If the regression equation of Y on X is linear, then it does not necessarily suggest that the regression equation of X on Y is also linear and vice-versa. The dependent variable a function of independent variables with respective constant parameters and an error term which is again a random variable. A regression model has the expression:

Y=f 0,1,2,…, n+ϵ

Where Y is the dependent variable, X1, X2+…Xn are independent variables, 0,1,2,…, n are the regression coefficients and is the error term and normally distributed with mean 0 and variance 2.  This type of regression model is also known as a deterministic model.

Example of regression

Regression
Regression in Supervised Machine Learning

An example of simple linear regression can be regressing the weight of a group of people on the basis of their height. Here Height and weight are the independent and dependent variable respectively. As a person height determines his weight, not the vice versa.

The blue line in the above figure is the regression line fitted with a supervised machine learning technique. This represents the best-fitted line obtained through a rigorous training process until a robust model with acceptable accuracy is achieved.

To perform regression a number of algorithms are used by researchers. The most frequently used ones are:

  • Simple linear regression
  • Multiple linear regression
  • Logistic regression
  • Polynomial regression etc.

Machine Learning: Some lesser known facts

Machine Learning

Machine Learning (ML) has become a buzz word in today’s world. Although we can have its references since the middle of the twentieth century it has gained its popularity during the last few years. Mainly because of its immense capability to explore a large amount of data without the need for any programming and hence the simplicity to use.

Since Machine learning is still a new concept and there are several doubts and misconception about it. In this article, I will try to explore some of these facts that are less known about Machine Learning along with very basic ideas like what is Machine Learning and how it is making our lives better.

Let’s start with a famous conversation of an interview to hire a Machine Learning expert. You must have read this before but I like this so much and it can give a good start to this article. 

So as the interview starts, the interviewer starts asking questions to the candidate:

Interviewer: What is your specialization?

Candidate: Machine Learning

Interviewer: What is 23+34?

Candidate: It’s 10

Interviewer: No, wrong answer, its 57

Candidate: It’s 35

Interviewer: No, wrong answer again, it’s 57

Candidate: It’s 50

Interviewer: No, the answer is still 57

Candidate: It’s 57

Interviewer: You are hired !!!

Although it is a joke, to some extent it reflects how the Machine Learning works. Machine Learning is all about learning from the data it is fed with. Here is a famous quote from Thomas H. Davenport, Analytics thought-leader from the Wall Street Journal which reflects the power of Machine Learning;

“Human can create one or two good models a week; Machine Learning can create thousands of good models a week”

Thomas H. Davenport, Analytics thought-leader from the Wall Street Journal

Importance of Machine Learning in the present context

Today we have a huge amount of data popularly known as big data. This can be a gold mine of knowledge if used and explored properly. Data mining, Baysian analysis all these are getting popular only because they also cater to extract information from a big pile of data. 

As the volume of data increased, so its complexity. The data comes from varieties of sources, consists of numerous fields. We need modelling techniques which can analyze such kind of data quickly with improved accuracy. So here is Machine Learning for you.

So, what is Machine Learning ?

Machine learning in simple term is converting knowledge from information. We have a huge amount of data in our custody, generated throughout a period over more than 50 years. If it is not used to generate knowledge out of it then this huge volume of data is of no use and we are just scrapping a very valuable resource that can help solve many challenges of humanity.

It is as such a very vast field of data science and assimilates many concepts of other associated fields like Artificial Intelligence.

The beauty of Machine Learning is that it does not need programming by human rather as the name suggests it learns from the data it was fed. In this sense, it is similar to a human who also learns from their past experiences.

This learning comes through a rigorous process of observing the data, finding out the pattern in order to minimize the difference between actual and estimation. 

Machine Learning has three main categories, which are

Application of Machine Learning?

Application of Machine Learning
Use of Machine Learning
Photo by Andy Kelly on Unsplash

Recent advances made in Machine Learning enables computer some of the tasks which can only be handled by human until very recent time. In our daily life, we take help or use applications which use this technique and most of the time we don’t even know that it is Machine Learning which is making our lives easier.

In daily life

We can take a simple example of getting personalised Google news. This application which type of news you are interested in by keeping an account of your likes and dislikes as you time to time input in Google’s database. The same technique is used by Facebook to suggest you groups or pages that you may like. Ever wonder how your email service identifies spam emails for you and discriminates from important mails, thanks to ML.

Online video streaming services like Netflix, Amazon Prime, Hotstar etc. or music streaming applications like Spotify all of them have a nice feature which automatically populates your account with contents you prefer. Here the essence is Machine Learning; it analyzes your popular choices and suggests content according to your choice.

Image/speech recognition & medical research

Image recognition uses this technology to answer whether an animal is a cat or dog, identifying persons crossing the road, identifying your handwriting and converting into texts and many more.

In a similar way converting voice into text which is predominantly in use in several platforms like speech to text tool in Google doc and here also ML plays an important role.

In medical research, ML is a fast-growing technology. It helps in analyzing voluminous data and to identify trends and patterns.  Especially with the advent of wearable devices and sensors which keep track of vital parameters of patient’s health. The data generated by these devices are analyzed through ML often in real-time to enable medical practitioners to detect any trend and red flag any symptom for better diagnosis. 

Oil and gas sector

In this sector, ML finds its use to identify natural resources like minerals under the ground, pointing out any risk involved in the performance of the refinery sensors and chance of failure, also preparing an optimized oil distribution plan to make it more cost-effective and efficient.

Thus almost in every sector of our society, the use of Machine Learning is rapidly expanding. In absence of Machine Learning, performing such a resource-intensive and time-consuming process would not be at all feasible in traditional ways.

Futuristic applications

Few applications of Machine Learning which are still in the testing phase, are always been the popular topics of science fiction stories. We are now frequently hearing and reading about self-driving cars of Google or Tesla. This is already a reality now, but go back 10 years, such a concept used to be a subject of science fiction only. The basic concept behind this revolutionary invention is Machine Learning.

Almost every industry who deal with a large amount of data has realized the importance of Machine Learning. Be it banking and finance sector, automobile, research or health care sector ML enables them to work more efficiently and have an edge over their competitors with the help of data insights often in real-time. 

So, what is Artificial Intelligence (AI) then?

If you have read up to this, then this question is most probably rising in your mind and it is bound to. Although most of the times we use the terms AI and Machine Learning interchangeably they are not the same. AI makes machines to emulate human intelligence whereas ML helps machines to learn from data.

Read this article for a brief about Artificial Intelligence (AI)

Artificial Neural Network (ANN) as its name suggests it mimics the neural network of our brain hence it is artificial. The human brain has a highly complicated network of nerve cells to carry the sensation to its designated section of the brain. The nerve cell or neurons form a network and transfer the sensation one to another. Similarly in ANN also a number of inputs pass through several layers similar to neurons and ultimately produce an estimation.

Schematic diagram of Artificial Neural Network
Schematic diagram of Artificial Neural Network

Machine Learning is a way to implement Artificial Intelligence. Machine Learning has been in application since decades but in recent days as Artificial Intelligence came into action Machine Learning, to be more specific Deep Learning has become more popular.

ANN: a deep learning process

ANN is a deep learning process, the burning topic of data science. Deep learning is basically a subfield of Machine Learning. You may be familiar to the machine learning process and if not you can refer to this article for a quick working knowledge on it. Talking about deep learning, it is in recent times find its application in almost all ambitious projects. Starting from basic pattern recognition, voice recognition to face recognition, self-driving car, high-end projects in robotics and artificial intelligence deep learning is revolutionizing the modern applied science.

Read about supervised machine learning here

ANN is a very efficient and popular process of pattern recognition. But the process involves complex computations and several iterations. The advent of high-end computing devices and machine learning technologies have made our task much easier than ever. Users and researchers can now focus only on their research problem without taking the pain of implementing a complex ANN algorithm.

The concept of Artificial Intelligence although not very new, it was first used in 1950 and was supposed to use a computer to perform such activities which can only be done by human beings only.

So in that sense, AI is a much broader concept and ML can be considered as a subset of it. AI is as a whole mimics the concept of human intelligence and to achieve it ML plays a very important role by extracting information from data without the need for programming.

Machine Learning Vs Deep Learning Vs Data Mining

Often these three concepts are little confusing and the main reason is all these techniques have the same goal, which is to get an insight, relationship or trend of the data in hand. But they differ in their execution and abilities. 

Machine Learning

As we discussed, Machine Learning functions more like statistical models, where there is a mathematically proven strong theory about the distribution of the data and it is assumed that the data fulfil some assumptions too. The advantage of Machine Learning is that even if we do not have any theoretical idea about the distribution of the data it can learn from the data through several iterations until the best pattern is found. Hence, the process of ML can be easily automated too.

Data Mining

It is a much broader concept with the same objective as ML and encompassed a variety of concepts to achieve that. Like deep learning uses traditional statistical theories, text analytics, time series algorithm, data manipulation techniques and even Machine Learning too in order to identify an underlying pattern in the data. 

Deep Learning

It is a more advance concept compare to the above two. Deep learning involves the state of the art technologies combining modern high-end computing and neural networks to identify complex patterns in a large amount of data. Advance technologies like image recognition, recognizing words from the sound which are still in the testing stage are all subject of deep learning.

Some facts on Machine Learning

At the very beginning I have mentioned that being a new concept, some ideas about Machine Learning are also popular but not completely true. Here I will try to discuss all those lesser-known facts about ML.

Fact 1: It is not complete automated process and human intervention is required

There is a misconception that ML is a 100% automated process, which is not completely true and human intervention is necessary to create and improve algorithms. The system needs context and parameters to operate which again provided by human operators.

Fact 2: Having advance knowledge in Mathematics is not a prerequisite for simple application of Machine Learning

You can start the application of ML to analyze your data with some practice and guidance. There are lots of content available on the internet some of them are free whereas few are premium courses.

To start practising with ML you can choose any of the free courses. The main factor is you have to practice a lot. I can suggest you a free crash course on ML by Google Developers, developed by Google, so no question about the quality.

The MOOC’s course on ML in Coursera is also very good to start your learning session.

Fact 3: Machine Learning and Artificial Intelligence are not the same

Some people have this notion that these two are same, even I used to have the same idea until I came across this article published in Forbes. It was a very good comprehensive discussion about the differences between these two, read it you will get your many doubts about ML and AI cleared.

Fact 4: Even without a very sound knowledge of programming language you can learn the application of ML

Oh… it certainly helps, having good knowledge in a few programming languages can help you jump start your carrier in ML, but it is not at all an essential one. Its just you have to give some little more time when you are first time writing your code for Machine Learning. Be it R or Python or any other language, you learn it by making errors, this is the most effective way of learning any language.

So, in nutshell, if you are interested in learning ML, just start it now, take a small dataset, write a small piece of code. There will be errors in the beginning, don’t let it hold you back. Soon you will start enjoying its beauty and it will get more and more interesting.

References