Machine learning vs. data science: how they are different?

Machine learning vs data science

Machine learning and data science are two major key words of recent times almost all fields of science depend on. If data science is inevitable to explore the knowledge hidden in the data then machine learning is something bringing evolution through feature engineering. But the question is are they very different? In this article, these two fields will be discussed point by point where they are different and if there are any similarities.

The Venn Diagram of sciences

I got a good representation of how data science overlaps with the machine learning domain through Venn diagram from this website. Drew Conway in 2010 gave this concept. 

Venn diagram showing relation between data science and machine learning
Venn diagram showing relation between data science and machine learning

Now with these Venn Diagram structure, the association of all these fields are pretty clear. The lowermost circle essentially indicates the domain knowledge of a particular field. For example, it may be a field of agriculture crop production or population dynamics etc. A data scientist should know about his particular domain besides core knowledge of programming and statistics/mathematics.

Further you can see that data science is common to all three domains. Whereas machine learning lies in the intersection of statistics, mathematics knowledge and the sphere of hacking skill. The major difference between these two lies here. Data science being a more broad concept, requires special subject knowledge to analysis. Where as Machine Learning is more coding and programming oriented field.

Lets dive into elaborate discussion of these differences…

Domain differences

To start with let’s be clear about their domains. Data science is a much bigger term. It comprises multiple disciplines like information technology, modelling and business management. Whereas machine learning is comparatively a specific terms common in data science where the algorithm learns from the data. 

Unlike data science, machine learning is more practical than empirical. Data science has much more extensive theoretical base and amenable to mathematical analysis. machine learning, on the other hand, is mainly a computer program based needs coding skills.

Lets first discuss these two fields first.

Machine Learning

As we have seen in the above Venn Diagram, data science and machine learning have common uses. Data science uses the tools of machine learning to study transactional data for useful prediction. Machine learning helps in pattern discovery from the data.

Machine learning is actually learning from the data. The historical data trains the machine learning algorithms to make an accurate prediction. Such a learning process is called supervised learning

There are situations where no such training data available. Then there are some machine learning algorithms which works without training. This type of machine learning is known as unsupervised machine learning. Obviously the accuracy here is less than the supervised one. But here the situation is also different.

Another kind of machine learning is known as Reinforcement learning. This one is the most advanced and popular machine learning. Here is also the training data is absent and the algorithm learns from its experience. 

Deep learning is again a special field of machine learning. Lets discuss briefly about it too.

Deep learning

Deep learning is a subfield of Machine learning which is again a subfield of Artificial Intelligence. In this context deep learning deals with the data as machine learning does; the difference lies in the learning process. Scalability is also a point where these two processes are different from each other. 

Deep learning especially a superior method when the data in hand is very vast. Deep learning is very efficient in taking benefit of the large data set. Contrary to machine learning models or other conventional regression models, where the models’ accuracy does not increase after a certain level. The deep learning algorithm goes on improving the model by training it with more and more data.

The deep learning process is a black box method. That means we will only see the inputs and the output. What is going on in between, how does the network work remains obscure. 

The name deep learning actually refers to the hidden layers of the training process. The backpropagation algorithm takes the feedback signal from the output to adjust the weights used in the hidden layers and refines the output in the next cycle. This process goes on until we get a satisfactory model.

Data science

We can consider data science as a bridge between the traditional statistical and mathematical science and their application to solving real-world problems. The theoretical knowledge of basic sciences many times remains unused. Data science makes this knowledge applicable to solve practical problems. 

More lightly, we can say that a data scientist must have more programming skill than most of the scientists and more statistical skill than a programmer has. No surprise that just mention of data science in anyone’s CV makes him eligible for an enhanced pay package. 

Since in almost all organizations are generating data in an exponential amount, they need data scientists to get meaningful insights out of that. Moreover, after the explosion of internet users, the data generated online is enormous. Data science applies data modelling and data warehousing to keep track of this ever-growing data.

Necessary skills to be a data scientist

A data scientist needs to be proficient in both theoretical concepts as well as programming languages like R and Python etc. One person with a good understanding of the underlying statistical concepts can only develop a sound algorithm for its implementation.

But a data scientist’s job does not end here. These two core subjects knowledge is essential no doubt. But to become a successful data scientist a person must provide a complete business solution. When any organisation appoints some data scientist, they are supposed to analyse the data to gain insights about the potential business opportunities and provide the roadmap. 

So, a data scientist should also possess knowledge of the particular business domain and communication skill. Without effective communication and result interpretation, even a good analytical report may lead to a disappointing result. So none of these four pillars of success is less important.

The four pillars of data science

I got a good representation of these four pillars of data science through Venn diagram from this website which was originally created by David Taylor a Biotechnologist in his article “Battle of Data Science Venn Diagrams”. 

Four pillars of data science
Four pillars of data science

These four different streams are considered as the pillars of data science. But why so? Let’s take real-world examples to understand how data science plays an important role in our daily lives.

Example 1: online shopping

Think about your online shopping experience. Whenever you log in your favourite online shopping platform you get deals on items you like. Items are organized according to your interest. Have you ever thought how on the earth the website does that?

Every time when you visit the online retail website search for things of your interest or purchase something; you generate data. The website stores historical data of your interaction with the shopping platform. If anyone with data science skill analyses the data properly he may know about your purchase behaviour even better than you.

Example 2: Indian Railways

Indian Railways is the fourth-largest network in the world. Every day thousands of trains are operated through which crores of passengers travel across the country. It has a track length over 70,000 km. 

So, quite such a vast network generates a huge amount of data every day. The ticket booking system, train operation, biometrics, crew management, train schedule in every aspect the data generated is big data. And if we consider the historical data is no less than a gold mine of information on Indian passengers’ travel trend over the years.

Application of data science on this big data reveals very important information to enable the authority to take accurate decisions about during which season there is a rush of passengers and additional trains need to run; which routes are profitable, running special trains and many more. 

So in a nutshell, the main tasks of data science are:

  • Filtering the required data from big data
  • Cleaning the raw data to make it amenable to analysis
  • Data visualization
  • Data analysis
  • Interpretation and valid conclusion

Differences

As we discussed all of them at length, we came to know that in spite of many similarities these two subjects have some differences in their application. So, now its time to point out the specific differences between machine learning and data science. Herre they are:

Data scienceMachine Learning
Based on extensive theoretical concepts of statistics and mathematicsKnowledge of computer programming and computer science fundamentals are essential
Generally performs various data operationsIt is a subset of Artificial Intelligence
Gives emphasis on data visualizationData evaluation and modelling is required for the feature engineering
It extracts insights from the data by cleaning, visualizing and interpreting dataIt learns from data and finds out the hidden pattern
Knowledge of programming languages like R, Python, SAS, Scala etc. is essentialKnowledge of probability and statistics is essential
A data scientist should have knowledge of machine learningRequires in-depth knowledge of programming skills
Popular tools use in data science are like Tableau, Matlab, Apache Spark etc.Popular tools used in machine learning are like IBM Watson studio, Microsoft azure ML studio etc.
Structured and unstructured data are the key ingredients Here statistical models are the key players
It has its applications in fraud detection, trend prediction, credit risk analysis etc.Image classification, speech recognition, feature extraction are some popular application of machine learning
Difference between data science and machine learning

Conclusion

To end with I would like to summarize the whole discussion saying that, data science is a comparatively newer field of science and of great demand across the organizations. Mainly because of its immense power of providing insights analyzing big data which otherwise has no meaning to the organisations.

On the other hand machine learning is an approach which enables the computer to learn from the data. A data scientist should have the knowledge of machine learning in order to unravel its full potential. So, they do have some overlapping parts and complimentary skills.

I hope the article contains sufficient discussion to make you understand the similarity as well as difference between machine learning and data science. If you have any question, doubt please comment below. I would like to answer them.

Comparing machine learning models for a regression problem

Comparing regression models

Comparing different machine learning models for a regression problem is necessary to find out which model is the most efficient and provide the most accurate result. There are many test criteria to compare the models. In this article, we will take a regression problem, fit different popular regression models and select the best one of them. 

We have discussed how to compare different machine learning problems when we have a classification problem in hand (the article is here). That means in such cases the response variable is a categorical one. Different popular classification algorithms are compared to come out with the best algorithm. 

NB: Being a non-native English speaker, I always take extra care to proofread my articles with Grammarly. It is the best grammar and spellchecker available online. Read here my review of using Grammarly for more than two years. 

Comparing regression models

So, what if the response variable is a continuous one and not categorical. This is a problem of regression then and we have to use regression models to estimate the predicted values. In this case also several candidate regression models. Our task is to find the one which serves our purpose. 

So, in this article, we are taking a regression problem of predicting the value of a continuous variable. We will compare several regression models, compare their performance calculating the prediction accuracy and several goodnesses of fit statistics. 

Here I have used five most prominent and popular regression models and compared them according to their prediction accuracy. The supervised models used here are

The models were compared using two very popular model comparison metrics namely Mean Absolute Error(MAE) and Mean Square Error (MSE). The expressions for these two metrics are as below:

Mean Absolute Error(MAE)

Comparing different machine learning models for a regression problem involves an important part of comparing original and estimated values. If \dpi{150} y is the response variable and \dpi{150} \widehat{y} is the estimate then MAE is the error between these \dpi{150} n pair of variables and calculated with this equation:

MAE is a scale-dependent metric that means it has the same unit as the original variables. So, this is not a very reliable statistic when comparing models applied to different series with different units. It measures the mean of the absolute error between the true and estimated values of the same variable. 

Mean Square Error (MSE)

This metric of model comparison is as the name suggests calculate the mean of the squares of the error between true and estimated values. So, the equation is as below:

Python code for comparing the models

So, now the comparison between different machine learning models is conducted using python. We will see step by step application of all the models and how their performance can be compared.

Loading required libraries

All the required libraries are first loaded here.

import numpy as np # linear algebra
import pandas as pd # data processing
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn import metrics
from pandas import DataFrame,Series
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
import matplotlib
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.model_selection import train_test_split,cross_val_score, cross_val_predict
import missingno as msno # plotting missing data
import seaborn as sns # plotting library
from sklearn import svm

The example data and its preprocessing

The data set used here is the car data set from Github and you can access the data file from this link. The data set has the following independent variables:

  • Age
  • Gender
  • Average miles driven per day
  • Personal debt and
  • Monthly income

Based on these independent variables we have to predict the potential sale value of a car. So, here the response variable is the sale value of the car and it is a continuous variable. That is why the problem in hand is a regression problem.

Importing the data

The below piece of code will use the panda library read() function to import the data set into the working space. The describe() function is for a brief idea about the data.

dataset = pd.read_csv("cars.csv")
dataset.describe()

Displaying the last few columns of the data set to have a glimpse of the data and variables.

Last few columns of the data set

Check the data for missing values

The following code is to check if there any missing value in the data set. Missing value creates a problem in the analysis process. So, we should filter these values in the data pre-processing stage. Here we will find out which columns contain missing values and the corresponding rows will be simply dropped from the data set.

# Finding all the columns with NULL values
dataset.isna().sum()
# Drop the rows with missing values
dataset = dataset.dropna()

Creating basic plots with the data

Here we create the joint distribution plot of the independent variables

sns.pairplot(dataset[['age', 'miles', 'debt', 'income', 'sales']], diag_kind="kde")
Joint distribution plot of the  independent variables
Joint distribution plot of the independent variables

Splitting the data set

Data splitting is required to create training and testing data sets from the same car data. I have taken 80% of the whole data set as training data and the rest 20% of data as the test data set. The following python code is for this splitting purpose.

train_dataset = dataset.sample(frac=0.8,random_state=0)
test_dataset = dataset.drop(train_dataset.index)

Normalizing the training data set

First of all we will see the summary statistics of all the variables using the describe() function of sklearn library.

# Calculating basic statistics with the train data
train_stats = train_dataset.describe()
train_stats.pop("sales") # excluding the dependent variable
train_stats = train_stats.transpose()
train_stats

Here from the below stats about the data set, we can see that different variables in the data set has very large range and deviations, which may create problem during model fitting. So, before we use this variables in model building process, we will normalize the variables.

Summary statistics of the  training data set
Summary statistics of the training data set

Creating a function for normalization

Using the mean and standard deviation of each of the variables, we will convert them into standard normal variates. For that purpose, we will create the function as in below.

# Creating the normalizing function with mean and standard deviation
def norm(x):
  return (x - train_stats['mean']) / train_stats['std']
normed_train_data = norm(train_dataset)
normed_test_data = norm(test_dataset)

Separating the response variable and creating other variables

Now a most important step to store the response variable in a separate variable.

train_labels = train_dataset.pop("sales") # using .pop function to store only the dependent variable
test_labels = test_dataset.pop("sales")
x_train=normed_train_data
x_test=normed_test_data
y_train=train_labels
y_test=test_labels

As we are now finished with the data pre-processing stage, we will start with the modelling steps. So, let’s start coding for all the five models I have mentioned to predict the car sale price.

Application of Multiple Linear Regression

First of all Multiple Linear Regression (MLR). This simple linear regression only but we will include all the independent variables to estimate the car sale price. The LinearRegression() function from LinearModel module of sklearn library has been used here for the purpose.

lin_reg = LinearRegression()
lin_reg.fit(x_train,y_train)
#Prediction using test set 
y_pred = lin_reg.predict(x_test)
mae=metrics.mean_absolute_error(y_test, y_pred)
mse=metrics.mean_squared_error(y_test, y_pred)
# Printing the metrics
print('R2 square:',metrics.r2_score(y_test, y_pred))
print('MAE: ', mae)
print('MSE: ', mse)
Metrics for MLR

Application of Decision Tree regression

dt_regressor = DecisionTreeRegressor(random_state = 0)
dt_regressor.fit(x_train,y_train)
#Predicting using test set 
y_pred = dt_regressor.predict(x_test)
mae=metrics.mean_absolute_error(y_test, y_pred)
mse=metrics.mean_squared_error(y_test, y_pred)
# Printing the metrics
print('Suppport Vector Regression Accuracy: ', dt_regressor.score(x_test,y_test))
print('R2 square:',metrics.r2_score(y_test, y_pred))
print('MAE: ', mae)
print('MSE: ', mse)
Metrics for Decision tree
Metrics for Decision tree

Application of Random Forest Regression

rf_regressor = RandomForestRegressor(n_estimators = 300 ,  random_state = 0)
rf_regressor.fit(x_train,y_train)
#Predicting the SalePrices using test set 
y_pred = rf_regressor.predict(x_test)
mae=metrics.mean_absolute_error(y_test, y_pred)
mse=metrics.mean_squared_error(y_test, y_pred)
# Printing the metrics
print('Suppport Vector Regression Accuracy: ', rf_regressor.score(x_test,y_test))
print('R2 square:',metrics.r2_score(y_test, y_pred))
print('MAE: ', mae)
print('MSE: ', mse)
Metrics for Random Forest regression

Application of Support Vector Regression

from sklearn.svm import SVR
regressor= SVR(kernel='rbf')
regressor.fit(x_train,y_train)
y_pred_svm=regressor.predict(x_test)
#y_pred_svm = cross_val_predict(regressor, x, y)
mae=metrics.mean_absolute_error(y_test, y_pred_svm)
mse=metrics.mean_squared_error(y_test, y_pred_svm)
# Printing the metrics
print('Suppport Vector Regression Accuracy: ', regressor.score(x_test,y_test))
print('R2 square:',metrics.r2_score(y_test, y_pred_svm))
print('MAE: ', mae)
print('MSE: ', mse)
Metrics for Support Vector Regression

Application of Deep Learning using Keras library

Here is the deep learning model mentioned. A sequential model has been used. The model has been created as a function named build_model so that we can call it anytime it is required in the process. The model has two connected hidden layers with a Rectified Linear Unit (relu) function and an output layer with a linear function.

The hidden layers have 12 and 8 neurons respectively with all the 8 input variables. Mean Squared Error is the loss function here as it is the most common loss function in case of regression problems.

def build_model():
  model = keras.Sequential([
    layers.Dense(12,kernel_initializer='normal', activation='relu', input_shape=[len(train_dataset.keys())]),
    layers.Dense(8, activation='relu'),
    layers.Dense(1, activation='linear')
  ])

  optimizer = tf.keras.optimizers.RMSprop(0.001)

  model.compile(loss='mse',
                optimizer=optimizer,
                metrics=['mae', 'mse'])
  return model
model = build_model()

Displaying the model summary

This part of code is to show the summary of model we built. All the specifications mentioned above has been shown in the below screenshot of the output.

model.summary()
Deep learning  model summary
Deep learning model summary

Training the model

We have used 10 rows of the training data set to check the model performance. As the result seems satisfactory so, we will proceed with the same model.

example_batch = normed_train_data[:10]
example_result = model.predict(example_batch)
example_result

Fitting the model

Now we will fit the model with 1000 epochs and store the model training and validation accuracy in the object named history.

EPOCHS = 1000

history = model.fit(
  normed_train_data, train_labels,
  epochs=EPOCHS, validation_split = 0.2, verbose=0,
  callbacks=[tfdocs.modeling.EpochDots()])
History of the model fit
History of the model fit

Here we will produce a glimpse of the history stats to understand how the training process progresses.

hist = pd.DataFrame(history.history)
hist['epoch'] = history.epoch
hist.tail()

Plotting the MAE score during the training process

As we are using 1000 epochs to train the model. It necessarily suggests that there are 1000 forward and backward passes while the model is trained. And we expect that with each passes the the loss will decrease and model’s prediction accuracy will increase as the training process progresses.

plotter = tfdocs.plots.HistoryPlotter(smoothing_std=2)
plotter.plot({'Basic': history}, metric = "mae")
plt.ylim([0, 10000])
plt.ylabel('MAE [sales]')

In the above plot, we can see that both the training and validation loss decreases in a exponential fashion with the increase in number of epochs.

test_predictions = model.predict(normed_test_data).flatten()
a = plt.axes(aspect='equal')
plt.scatter(test_labels, test_predictions)
plt.xlabel('True Values [sales]')
plt.ylabel('Predictions [sales]')
lims = [0, 40000]
plt.xlim(lims)
plt.ylim(lims)
_ = plt.plot(lims, lims)

Plotting the result

Here we have plotted the predicted sale prices against the true sale prices. And from the plot it is clear that the estimate is quite close to the original one.

Original Vs predicted values of sale price of cars
Original Vs predicted values of sale price of cars

Plotting the error

error = test_predictions - test_labels
plt.hist(error, bins = 125)
plt.xlabel("Prediction Error [Average_Fruit_fly_population]")
_ = plt.ylabel("Count")

Here we have plotted the error. Although the distribution of error is not a true Gaussian, but as the sample size increases, we can expect it will tend to a Gaussian distribution.

mae=metrics.mean_absolute_error(test_labels, test_predictions)
mse=metrics.mean_squared_error(test_labels, test_predictions)
# Printing the metrics
#print('Suppport Vector Regression Accuracy: ', lin_reg_pl.score(x_poly,y_train))
print('R2 square:',metrics.r2_score(test_labels, test_predictions))
print('MAE: ', mae)
print('MSE: ', mse)
Metrics of Deep learning models

Conclusion

So, here we can compare the performance of all the models using the metrics calculated. Let’s see all the models used to predict the car sale price together along with the metrics for the ease of comparison.

Model typeMAER Square
MLR28210.80
Decision Tree22110.84
Random Forest18170.88
Support Vector Machine72320
Deep learning/ANN27860.8
Comparison table for all the models used

From the table above, it is clear that for the present problem, the best performing model is Random Forest with the highest R square (Coefficient of Determination) and least MAE. But we have to keep in mind that the deep learning is also not far behind with respect to the metrics. And the beauty of deep learning is that with the increase in the training sample size, the accuracy of the model also increases.

Whereas in case of other models after a certain phase it attains a plateau in terms of model prediction accuracy. Even increasing training sample size also can not further improve the model’s performance. So, although deep learning occupies the third position in present situation, it has the potential to improve itself further if availability of training data is not a constrain.

If the data set is small and we need a good prediction for the response variable as the case here; it is a good idea to go for models like Random Forest or Decision tree. As they are capable of generating good prediction with lesser training data or labelled data.

So, finally it is the call of the researcher or modeler to select the best suited model judging his situation and field of knowledge. As different fields of science generate experimental data with distinct nature and a very good model of another field may fail completely to predict.

References

Naive Bayes classifier application using python

Naive Bayes classifier application using python

The Naive Bayes classifier is very straight forward, easy and fast working machine learning technique. It is one of the most popular supervised machine learning techniques to classify data set with high dimensionality. In this article, you will get a thorough idea about how this algorithm works and also a step by step implementation with python. Naive Bayes’ actually a simplified form of Bayes’ theorem so we will cover that too.

Under Bayes’ theorem, no theory is perfect, Rather, it is a work in progress, always subject to further refinement and testing.” ~ Nate Silver

In real life application of classification problem is everywhere. We are taking different decisions in our daily life judging probability of several other factors either consciously or unconsciously. When we are in need to analyse large data and take a decision on its basis, we need some tool. Naive Bayes classifier is the simplest and very fast supervised learning algorithm which is also accurate enough. So, it can make our life far easier  in taking vital decisions.

The concept of Bayes’ theorem

To know the Naive Bayes’ classification concept we have to understand the Bayes’ theorem first. A Bayesian classification describes the relationship between conditional probabilities of different events. This theorem calculates the probability of any hypothesis provided the information of any event. 

For example, we the cricket lovers try to guess whether we will be able to play today depending on the weather variables. A banker tries to make sure if the customer is risky to give a credit depending on his financial transaction history or a businessman tries to judge whether his newly launched product is going to be a hit or flop among the customer depending on the customers buying behaviour.

This type of model dealing with conditional probabilities is called generative models. They are generative because of the fact they actually specify the hypothetical random process of data generation. But the training of such generative models for each event is really very difficult task. 

So how to tackle this issue? Here comes the concept of Naive Bayes’ classifier. The name Naive because it assumes some very simple things about the Bayes’ model. Like the presence of any feature in any class does not depend on any other feature. It simply overlooks the relationship between the features and considers that all the features independently contributes toward the target variable.

[docxpresso file=”https://dibyendudeb.com/wp-content/uploads/2020/06/Naive-Bayes-classifier-application-using-python.odt”]

In the data set, the feature variable is test report having values as positive and negative whereas the binomial target variable is “sick” with values as “yes” or “no”. Let us assume the data set has 20 cases of test results which are as below:

The data set

Creating a frequency table of the attributes of the data set

So if we create the frequency table for the above data set it will look like this

The frequency table

With the help of this frequency table, we can now prepare the likelihood table to calculate prior and posterior probabilities. See the below figure.

Calculating probabilities using likelihood tables
Calculating probabilities using likelihood tables

With the help of this above table, we can now calculate what is the probability that a person is really suffering from a disease when his test report was also positive.

So the probability we want to compute is 

P(yes|positive)=P(positive|yes)P(yes)P(positive)

We have already calculated the probabilities, so, we can directly put the values in the above equation and get the probability we want to calculate.

P(yes|positive)=0.73×0.550.55=0.73

In the same fashion, we can also calculate the probability of a person of not having a disease in spite of the test report being positive.

P(no|positive)=P(positive|no)P(no)P(positive)=0.33×0.450.55=0.27

Application of Naive Bayes’ classification with python

Now the most interesting part of the article is here. We will implement Naive Bayes’ classification using python. To do that we will use the popular scikit-learn library and its functions. 

About the data

We will take the same diabetes data we have used earlier in other classification problem. 

The purpose of using the same data for all classification problems is to make you able to compare between different algorithms. You can judge the accuracy of each algorithm with their accuracies in classifying the data.

So, here the target variable has two classes that is if the person has diabetes or not. On the other hand, we have 9 independent or feature variables influencing the target variable.

Importing required libraries

The first step to start coding is to import all the libraries we are going to use. The basic libraries for any kind of data science projects are like pandas, numpy, matplotlib etc. The purpose of these libraries has an elaborate discussion in the article simple linear regression with python.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.naive_bayes import GaussianNB
import seaborn as sns 

About the data

The example dataset I have used here for demonstration purpose is from kaggle.com. The data collected by “National Institute of Diabetes and Digestive and Kidney Diseases”  contains vital parameters of diabetes patients belong to Pima Indian heritage.

Here is a glimpse of the first ten rows of the data set:

Diabetes data set for logistic regression
Diabetes data set for ANN

The data set has independent variables as several physiological parameters of a diabetes patient. The dependent variable is if the patient is suffering from diabetes or not. Here the dependent column contains binary variable 1 indicating the person is suffering from diabetes and 0 he is not a patient of diabetes.

dataset=pd.read_csv('diabetes.csv')
dataset.shape
dataset.head()
This image has an empty alt attribute; its file name is image-50.png
# Printing data details
print(dataset.info) # for a quick view of the data
print(dataset.head) # printing first few rows of the data
dataset.tail        # to show last few rows of the data
dataset.sample(10)  # display a sample of 10 rows from the data
dataset.describe    # printing summary statistics of the data
pd.isnull(dataset)  # check for any null values in the data
This image has an empty alt attribute; its file name is image-52.png
Checking if the dataset has any null value

Creating variables

As we can see that the data frame contains nine variables in nine columns. The first eight columns contain the independent variables. These are some physiological variables having a correlation with diabetes symptoms. The ninth column shows if the patient is diabetic or not. So, here the x stores the independent variables and y stores the dependent variable diabetes count.

x=dataset.iloc[:,: -1]
y=dataset.iloc[:,-1]

Splitting the data for training and testing

Here we will split the data set in training and testing set with 80:20 ratio. We will use the train_test_split function of the scikit-learn library. The test_size mentioned in the code decides what proportion of data will be kept aside to test the trained model. The test data will remain unused in the training process and will act as an independent data during testing.

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test=train_test_split(x,y, test_size=0.2, random_state=0)

Fitting the Naive Bayes’ model

Here we fit the model with the training set.

model=GaussianNB()
model.fit(x_train,y_train)

Using the Naive Bayes’ model for prediction

Now as the model has been fitted using the training set, we will use the test data to make prediction.

y_pred=model.predict(x_test)

Checking the accuracy of the fitted model

As we already have the observations corresponding to the test data set, so, we can compare that with the prediction to check how accurate the model’s prediction is. Scikit-learn’s metrics module has the function called accuracy_score which we will use here.

from sklearn import metrics
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))

Conclusion

So, we have completed the whole process of applying Naive Bayes’ classification using python and also we are now through its basic concepts. It will be a little confusing at first. As you solve more practical problems with this application you will become more confident. 

This particular classifying technique is actually based on the Bayesian classification method. The name “Naive” it gets due to its oversimplification of the original Bayes theorem. The Naive Bayes classifier assumes that each pair of features has the conditional independence given the value of the target variable.

The Naive Bayes classifier can be a good choice for all types of classification problem be it binomial or multinomial. The algorithms extremely fast and straightforward technique can help us to take a quick decision. If the result of this classifier is accurate enough (which is the most common case) then it’s fine otherwise we can always take help of other classifiers like decision tree or random forest etc.

So, I hope this article will help you gain an in-depth knowledge about Naive Bayes’ theory and its application to solve real-world problems. In case of any doubt or queries please let me know through comments below.

References

Comparing the performance of different machine learning algorithms

Comparing machine learning algorithms

Comparing Machine Learning Algorithms (MLAs) are important to come out with the best-suited algorithm for a particular problem. This post discusses comparing different machine learning algorithms and how we can do this using scikit-learn package of python. You will learn how to compare multiple MLAs at a time using more than one fit statistics provided by scikit-learn and also creating plots to visualize the differences.

Machine Learning Algorithms (MLA) are very popular to solve different computational problems. Especially when the data set is huge and complex with no parameters known MLAs are like blessings to data scientists. The algorithms quickly analyze the data to learn the dependencies and relations between the variables and produce estimation with lot more accuracy than the conventional regression models.

Most common and frequently used machine learning models are supervised models. These models tend to learn about the data from experience. Its like the labelled data acts as teacher to train it to be perfect. As the training data size increases the model estimation gets more accurate.

Here are some recommended articles to know the basics of machine learning

NB: Being a non-native English speaker, I always take extra care to proofread my articles with Grammarly. It is the best grammar and spellchecker available online. Read here my review of using Grammarly for more than two years. 

Why we should compare machine learning algorithms

Other types of MLAs are the unsupervised and semi-supervised type which are helpful when the training data is not available and still we have to make some estimation. As these models are not trained using labelled data set naturally, these algorithms are not as accurate as supervised ones. But still, they have their own advantages.

All these MLAs are useful depending on situations and data types and to have the best estimation. That’s why selecting a particular MLA is essential to come with a good estimation. There are several parameters which we need to compare to judge the best model. After that, the best found model need to be tested on an independent data set for its performance. Visualization of the performance is also a good way to compare between the models quickly.

So, here we will compare most of the MLAs using resampling methods like cross validation technique using scikit-learn package of python. And then model fit statistics like accuracy, precision, recall value etc will be calculated for comparison. ROC (Receiver Operating Characteristic) curve is also a easy to understand process for MLA comparison; so finally in a single figure all ROCs will be put to for the ease of model comparison.

Data set used

The same data set used here for application of all the MLAs. The example dataset I have used here for demonstration purpose is from kaggle.com. The data collected by “National Institute of Diabetes and Digestive and Kidney Diseases”  contains vital parameters of diabetes patients belong to Pima Indian heritage.

Here is a glimpse of the first ten rows of the data set:

Diabetes data set for logistic regression
Diabetes data set for ANN

The data set has independent variables as several physiological parameters of a diabetes patient. The dependent variable is if the patient is suffering from diabetes or not. Here the dependent column contains binary variable 1 indicating the person is suffering from diabetes and 0 he is not a patient of diabetes.

Code for comparing different machine learning algorithms

Lets jump to coding part. It is going to be a little lengthy code and a lot of MLAs will be compared. So, I have break down the complete code in segments. You can directly copy and pest the code and make little changes to suit your data.

Importing required packages

The first part is to load all the packages needed in this comparison. Besides the basic packages like pandas, numpy, matplotlib we will import some of the scikit-learn packages for application of the MLAs and their comparison.

#Importing basic packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#Importing sklearn modules
from sklearn.metrics import mean_squared_error,confusion_matrix, precision_score, recall_score, auc,roc_curve
from sklearn import ensemble, linear_model, neighbors, svm, tree, neural_network
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn import svm,model_selection, tree, linear_model, neighbors, naive_bayes, ensemble, discriminant_analysis, gaussian_process
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

Importing the data set and checking if there is any NULL values

This part of code will load the diabetes data set and check for any null values in the data frame.

#Loading the data and checking for missing values
dataset=pd.read_csv('diabetes.csv')
dataset.isnull().sum()

Checking the data set for any NULL values is very essential, as MLAs can not handle NULL values. We have to either eliminate the records with NULL values or replace them with the mean/median of the other values. we can see each of the variables are printed with number of null values. This data set has no null values so all are zero here.

No NULL values in the data set

Storing the independent and dependent variables

As we can see that the data frame contains nine variables in nine columns. The first eight columns contain the independent variables. These are some physiological variables having a correlation with diabetes symptoms. The ninth column shows if the patient is diabetic or not. So, here the x stores the independent variables and y stores the dependent variable diabetes count.

# Creating variables for analysis
x=dataset.iloc[:,: -1]
y=dataset.iloc[:,-1]

Splitting the data set

Here the data set has been divided into train and test data set. The test data set size is 20% of the total records. This test data will not be used in model training and work as an independent test data.

# Splitting train and split data
x_train, x_test, y_train, y_test=train_test_split(x,y,test_size=0.2, random_state=0)

Storing machine learning algorithms (MLA) in a variable

Some very popular MLAs we have selected here for comparison and stored in a variable; so that they can be used at later part of the process. The MLAs first we have taken up for comparison are Logistic Regression, Linear Discriminant Analysis, K-nearest neighbour classifier, Decision tree classifier, Naive-Bayes classifier and Support Vector Machine.

# Application of all Machine Learning methods
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))

Creating a box plot to compare there accuracy

This part of code creates a box plot for all the models against their cross validation score.

# evaluate each model in turn
results = []
names = []
scoring = 'accuracy'
for name, model in models:
	kfold = model_selection.KFold(n_splits=10, random_state=seed)
	cv_results = model_selection.cross_val_score(model, x_train, y_train, cv=kfold, scoring=scoring)
	results.append(cv_results)
	names.append(name)
	msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
	print(msg)
# boxplot algorithm comparison
fig = plt.figure()
fig.suptitle('Comparison between different MLAs')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

The cross validation score are printed below and it is clearly suggesting Logistic Regression and Linear Discriminant Analysis to the two most accurate MLAs.

Below is a box-whisker plot to visualize the same result.

Comparison between different MLAs

Comparing all machine learning algorithms

# Application of all Machine Learning methods
MLA = [
    #GLM
    linear_model.LogisticRegressionCV(),
    linear_model.PassiveAggressiveClassifier(),
    linear_model. RidgeClassifierCV(),
    linear_model.SGDClassifier(),
    linear_model.Perceptron(),
    
    #Ensemble Methods
    ensemble.AdaBoostClassifier(),
    ensemble.BaggingClassifier(),
    ensemble.ExtraTreesClassifier(),
    ensemble.GradientBoostingClassifier(),
    ensemble.RandomForestClassifier(),

    #Gaussian Processes
    gaussian_process.GaussianProcessClassifier(),
    
    #SVM
    svm.SVC(probability=True),
    svm.NuSVC(probability=True),
    svm.LinearSVC(),
    
    #Trees    
    tree.DecisionTreeClassifier(),
  
    #Navies Bayes
    naive_bayes.BernoulliNB(),
    naive_bayes.GaussianNB(),
    
    #Nearest Neighbor
    neighbors.KNeighborsClassifier(),
    ]
MLA_columns = []
MLA_compare = pd.DataFrame(columns = MLA_columns)

row_index = 0
for alg in MLA:  
    
    predicted = alg.fit(x_train, y_train).predict(x_test)
    fp, tp, th = roc_curve(y_test, predicted)
    MLA_name = alg.__class__.__name__
    MLA_compare.loc[row_index,'MLA used'] = MLA_name
    MLA_compare.loc[row_index, 'Train Accuracy'] = round(alg.score(x_train, y_train), 4)
    MLA_compare.loc[row_index, 'Test Accuracy'] = round(alg.score(x_test, y_test), 4)
    MLA_compare.loc[row_index, 'Precission'] = precision_score(y_test, predicted)
    MLA_compare.loc[row_index, 'Recall'] = recall_score(y_test, predicted)
    MLA_compare.loc[row_index, 'AUC'] = auc(fp, tp)

    row_index+=1
    
MLA_compare.sort_values(by = ['MLA Test Accuracy'], ascending = False, inplace = True)    
MLA_compare
Comparison of all machine learning algorithms

# Creating plot to show the train accuracy
plt.subplots(figsize=(13,5))
sns.barplot(x="MLA used", y="Train Accuracy",data=MLA_compare,palette='hot',edgecolor=sns.color_palette('dark',7))
plt.xticks(rotation=90)
plt.title('MLA Train Accuracy Comparison')
plt.show()
MLA train accuracy comparison
# Creating plot to show the test accuracy
plt.subplots(figsize=(13,5))
sns.barplot(x="MLA used", y="Test Accuracy",data=MLA_compare,palette='hot',edgecolor=sns.color_palette('dark',7))
plt.xticks(rotation=90)
plt.title('Accuraccy of different machine learning models')
plt.show()
Accuracy of different machine learning algorithms
# Creating plots to compare precission of the MLAs
plt.subplots(figsize=(13,5))
sns.barplot(x="MLA used", y="Precission",data=MLA_compare,palette='hot',edgecolor=sns.color_palette('dark',7))
plt.xticks(rotation=90)
plt.title('Comparing different Machine Learning Models')
plt.show()
Comparing different machine learning algorithms
# Creating plots for MLA recall comparison
plt.subplots(figsize=(13,5))
sns.barplot(x="MLA used", y="Recall values",data=MLA_compare,palette='hot',edgecolor=sns.color_palette('dark',7))
plt.xticks(rotation=90)
plt.title('MLA Recall Comparison')
plt.show()
Recall comparison of all Machine learning algorithms
# Creating plot for MLA AUC comparison
plt.subplots(figsize=(13,5))
sns.barplot(x="MLA used", y="AUC values",data=MLA_compare,palette='hot',edgecolor=sns.color_palette('dark',7))
plt.xticks(rotation=90)
plt.title('MLA AUC Comparison')
plt.show()
MLA AUC comparison

Creating ROC for all the models applied

Receiver Operating Characteristic (ROC) curve is a very important tool to diagnose the performance of MLAs by plotting the true positive rates against the false-positive rates at different threshold levels. The area under ROC curve often called AUC and it is also a good measure of the predictability of the machine learning algorithms. A higher AUC is an indication of more accurate prediction.

# Creating plot to show the ROC for all MLA
index = 1
for alg in MLA:
    
    
    predicted = alg.fit(x_train, y_train).predict(x_test)
    fp, tp, th = roc_curve(y_test, predicted)
    roc_auc_mla = auc(fp, tp)
    MLA_name = alg.__class__.__name__
    plt.plot(fp, tp, lw=2, alpha=0.3, label='ROC %s (AUC = %0.2f)'  % (MLA_name, roc_auc_mla))
   
    index+=1

plt.title('ROC Curve comparison')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.plot([0,1],[0,1],'r--')
plt.xlim([0,1])
plt.ylim([0,1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')    
plt.show()
ROC curve comparison

Conclusion

This post presents a detailed discussion on how we can compare several machine learning algorithms at a time to fund out the best one. The comparison task has been completed using different functions of scikit-learn package of python. We took help of some popular fit statistics to draw a comparison between the models. Additionally, the Receiver Operating Characteristic (ROC) is also a good measure of comparing several MLAs.

I hope this guide will help you to conclude your problem in hand and to proceed with the best MLA chosen through a rigorous comparison method. Please feel free to try the python code given here, copy-pest the code in your python compiler, run and apply on your data. In case of any problem faced in executing the comparison process write me in the comment below.

References

Decision tree for classification and regression using Python

Decision tree

Decision tree classification is a popular supervised machine learning algorithm and frequently used to classify categorical data as well as regressing continuous data. In this article, we will learn how can we implement decision tree classification using Scikit-learn package of Python

Decision tree classification helps to take vital decisions in banking and finance sectors like whether a credit/loan should be given to a customer or not depending on his risk bearing credentials; in medical test conditions like if a new medicine should be tried on a patient depending on his/her medical history and many more fields.

The above two cases are where the target variable is a bivariate one i.e. with only two categories of response. There can be cases where the target variable has more than two categories, the decision tree can be applied in such multinomial cases too. The decision tree can also handle both numerical and categorical data. So, no doubt a decision tree gives a lot of liberty to its users.

NB: Being a non-native English speaker, I always take extra care to proofread my articles with Grammarly. It is the best grammar and spellchecker available online. Read here my review of using Grammarly for more than two years. 

Introduction to decision tree

Decision tree problems generally consist of some existing conditions which determine its categorical response. If we arrange the conditions and the decisions depending on those conditions and again one of those decisions resulting in further decisions; the whole structure of decision making resembles a tree structure. Hence the name decision tree.

The first and topmost condition which initiates the decision-making process is called the root condition. The nodes from the root node are called either a leaf node or decision node depending on which one takes part in further decision making. In this way, a recursive process of continues unless and until all the elements are grouped into particular categories and final nodes are all leaf nodes.

An example of decision tree

Here we can take an example of recent COVID-19 epidemic problem related to the testing of positive cases. We all know that the main problem with this disease is that it is very infectious. So, to identify COVID positive patients and isolating them is very essential to stop its further spread. This needs rigorous testing. But COVID testing is a time consuming and resource-intensive process. It becomes more of a challenge in the case of countries like India with a strong 1.3 billion population.

So, if we can categorize which persons actually need testing it can save a lot of time and resources. We can straightway downsize the testing population significantly. So, it is a kind of divide and conquer policy. See the below decision tree for classifying persons who need to be tested.

An example of decision tree
An example of decision tree

The whole classification process is much similar to how a human being judges a situation and makes a decision. That’s why this machine learning technique is simple to understand and easier to implement. Further being a non-parametric approach this algorithm is applicable to any kind of data even when the distribution is not known.

The distinct character of a decision tree which makes it special among all other machine learning algorithms is that unlike them it is a white box technique. That means the logic used in the classification process is visible to us. Due to simple logic, the training time for this algorithm is far less even when the data size is huge with high dimensionality. Moreover, it is the decision tree which makes the foundation of advanced machine learning computing technique like the random forest, bagging, gradient boosting etc.

Advantages of decision tree

  • The decision tree has a great advantage of being capable of handling both numerical and categorical variables. Many other modelling techniques can handle only one kind of variable.
  • No data preprocessing is required. Except for missing values no other data processing steps like data standardization, use of dummy variables for categorical data are required for decision tree which saves a lot of user’s time.
  • The assumptions are not too rigid and model can slightly deviate from them.
  • The decision tree model validation can be done through statistical tests and the reliability can be established easily.
  • As it is a white box model, so the logic behind it is visible to us and we can easily interpret the result unlike the black-box model like an artificial neural network.

Now no technique can be without any flaws, there are always some flipside and decision tree is no exception.

Disadvantages of Decision tree

  • A very serious problem with a decision tree is that it is very much prone to overfitting. That means the prediction given by decision tree is often too accurate for a too specific situation with a too complex model. 
  • The classification by decision tree generally uses an algorithm which tends to find a local optimum result for each node. As this process follows recursively for each node, ultimately the whole process ends up finding a locally optimal instead of a globally optimal decision tree.
  • The result obtained from a decision tree is very unstable. A little variation in the data can lead to a completely different classification/regression result. That’s why the concept of random forest/ensemble technique came, this technique brings together the best result obtained from a number of models instead of relying on a single one.

Classification and Regression Tree (CART)

The decision tree has two main categories classification tree and regression tree. These two terms at a time called as CART. This term was first coined in 1984 by Leo Breiman, Jerome Friedman, Richard Olshen and Charles Stone. 

Classification

When the response is categorical in nature, the decision tree performs classification. Like the examples, I gave before, whether a person is sick or not or a product is pass or fail in a quality test. In all these cases the problem in hand is to include the target variable into a group. 

The target variable can be a binomial that is with only two categories like yes-no, male-female, sick-not sick etc. or the target variable can be multinomial that is with more than two categories. An example of a multinomial variable can be the economic status of people. It can have categories like very rich, rich, middle class, lower-middle class, poor, very poor etc. Now the benefit of the decision tree is a decision tree is capable of handling both binomial and multinomial variables.

Regression

On the other hand, the decision tree has its application in regression problem when the target variable is of continuous nature. For example, predicting the rainfall of a future date depending on other weather parameters. Here the target variable is a continuous one. So, it is a problem of regression. 

Application of Decision tree with Python

Here we will use the sci-kit learn package to implement the decision tree. The package has a function called DecisionTreeClasifier() which is capable of classifying both binomial (target variable with only two classes) and multinomial (target variable having more than two classes) variables.

Performing classification using decision tree

Importing required libraries

The first step to start coding is to import all the libraries we are going to use. The basic libraries for any kind of data science projects are like pandas, numpy, matplotlib etc. The purpose of these libraries has an elaborate discussion in the article simple linear regression with python.

# importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

About the data

The example dataset I have used here for demonstration purpose is from kaggle.com. The data collected by “National Institute of Diabetes and Digestive and Kidney Diseases”  contains vital parameters of diabetes patients belong to Pima Indian heritage.

Here is a glimpse of the first ten rows of the data set:

Diabetes data set for logistic regression
Diabetes data set for ANN

The data set has independent variables as several physiological parameters of a diabetes patient. The dependent variable is if the patient is suffering from diabetes or not. Here the dependent column contains binary variable 1 indicating the person is suffering from diabetes and 0 he is not a patient of diabetes.

dataset=pd.read_csv('diabetes.csv')
dataset.head()
# Printing data details
print(dataset.info) # for a quick view of the data
print(dataset.head) # printing first few rows of the data
dataset.tail        # to show last few rows of the data
dataset.sample(10)  # display a sample of 10 rows from the data
dataset.describe    # printing summary statistics of the data
pd.isnull(dataset)  # check for any null values in the data
Checking if the dataset has any null value

Creating variables

As we can see that the data frame contains nine variables in nine columns. The first eight columns contain the independent variables. These are some physiological variables having a correlation with diabetes symptoms. The ninth column shows if the patient is diabetic or not. So, here the x stores the independent variables and y stores the dependent variable diabetes count.

x=dataset.iloc[:,:-1].values
y=dataset.iloc[:,-1].values

Performing the classification

To do the classification we need to import the DecisionTreeClassifier() from sklearn. This special classifier is capable of classifying binary variable i.e. variable with only two classes as well as multiclass variables.

# Use of the classifier
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(x, y)

Plotting the tree

Now as the model is ready we can create the tree. The below line will create the tree.

tree.plot_tree()clf

Generally the plot thus created, is of very low resolution and gets distorted while using as image. One solution of this problem is to print it in pdf format, thus the resolution gets maintained.

# The dicision tree creation
tree.plot_tree(clf) 
plt.savefig('DT.pdf')

Another way to print a high resolution and quality image of the tree is to use Graphviz format importing export_graphviz() from tree.

# Creating better graph
import graphviz 
dot_data = tree.export_graphviz(clf, out_file=None) 
graph = graphviz.Source(dot_data) 
graph.render("diabetes") 
Decision tree to classify the data
Decision tree created using Graphviz

The tree represents the logic of classification in a very simple way. We can easily understand how the data has been classified and the steps to achieve that.

Performing regression using decision tree

About the data set

The dataset I have used here for demonstration purpose is from https://www.kaggle.com. The dataset contains the height and weight of persons and a column with their genders. The original dataset has more than thousands of rows, but for this regression purpose, I have used only the first 50 rows containing data on 25 male and 25 females.

Importing libraries

Additional to the basic libraries we imported in a classification problem, here we will need to import the DecisionTreeRegressor() from sklearn.

# Import the necessary modules and libraries
import numpy as np
from sklearn.tree import DecisionTreeRegressor
import matplotlib.pyplot as plt

Reading the dataset

I have already mentioned about the dataset used here for demonstration purpose. The below code is to import the data and store in a dataframe called dataset.

dataset=pd.read_csv('weight-height.csv')
print(dataset)

Here is a glimpse of the dataset

Dataset for random forest regression

Creating variables

As we can see that the dataframe contains three variables in three columns. The last two columns are only of our interest. We want to regress the weight of a person using the height of him/her. So, here the independent variable height is x and the dependent variable weight is y.

x=dataset.iloc[:,1:2].values
y=dataset.iloc[:,-1].values

Splitting the dataset

This is a common practice of splitting the whole data set for creating training and testing data set. Here we have set the test_size as 20% that means the training data set will consist 80% of the total data. The test data set works as an independent data set when need to test the classifier after it gets trained with training data.

# Splitting the data for training and testing
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test=train_test_split(x,y, test_size=0.20, random_state=0)

Fitting the decision tree regression

We have here fitted decision tree regression with two different depth values two draw a comparison between them.

# Creating regression models with two different depths
regr_1 = DecisionTreeRegressor(max_depth=2)
regr_2 = DecisionTreeRegressor(max_depth=5)
regr_1.fit(x_train, y_train)
regr_2.fit(x_train, y_train)

Prediction

The below line of codes will give predictions from both the regression models with two different depth values using a new independent variable set X_test.

# Making prediction
X_test = np.arange(50,75, 0.5)[:, np.newaxis]
y_1 = regr_1.predict(X_test)
y_2 = regr_2.predict(X_test)

Visualizing prediction performance

The below line of codes will generate a height vs weight scattered plot alongwith two prediction lines created from two different regression models.

# Plot the results
plt.figure()
plt.scatter(x, y, s=20, edgecolor="black",
            c="darkorange", label="data")
plt.plot(X_test, y_1, color="cornflowerblue",
         label="max_depth=2", linewidth=2)
plt.plot(X_test, y_2, color="yellowgreen", label="max_depth=5", linewidth=2)
plt.xlabel("Height")
plt.ylabel("Weight")
plt.title("Decision Tree Regression")
plt.legend()
plt.show()

Conclusion

In this post, you have learned about the decision tree and how it can be applied for classification as well as regression problem using scikit-learn of python.

The decision tree is a popular supervised machine learning algorithm and frequently used by data scientists. Its simple logic and easy algorithm are the main reason behind its popularity. Being a white box type algorithm, we can clearly understand how it is doing its work.

The DecisionTreeClassifier() and DecisionTreeRegressor() of scikit-learn are two very useful functions for applying decision tree and I hope you are confident about their use after reading this article.

If you have any question regarding this article or any confusion about its application in python post them in the comment below and I will try my best to answer them.

References

Artificial Neural Network with Python using Keras library

Artificial Neural Network

Artificial Neural Network (ANN) as its name suggests it mimics the neural network of our brain hence it is artificial. The human brain has a highly complicated network of nerve cells to carry the sensation to its designated section of the brain. The nerve cell or neurons form a network and transfer the sensation one to another. Similarly in ANN also a number of inputs pass through several layers similar to neurons and ultimately produce an estimation.

Schematic diagram of Artificial Neural Network
Schematic diagram of Artificial Neural Network
NB: Being a non-native English speaker, I always take extra care to proofread my articles with Grammarly. It is the best grammar and spellchecker available online. Read here my review of using Grammarly for more than two years. 

Perceptron: the simplest Artificial Neural Network

When any ANN consists of only one neuron it is called a perceptron. A perceptron has a single input node as well as a single output node. It is the same as the neuron in our brain consisting of dendrons and axons. 

Depending on your problem, there can be more than one neurons and even layers of neurons. In that situation, it is called multi-layer perceptron. In the above figure, we can see that there are two hidden layers. Generally we used to use ANN with 2-3 hidden layers but theoretically there is no limit.

Layers of an Artificial Neural Network

In the above figure you can see the complete network consists of some layers. Before you start with the application of ANN, understanding these layers is essential. So, here is a brief idea about the layers an ANN has

Input layer

The independent variables having real values are the components of input layer. Input variables can be more than one, discrete or continuous. They may need standardization before feeding into ANN if they have very diverse scale of data.

Hidden layer

The layers between the input and output are called hidden layers. Here the inputs gets associated with some weights and ultimately the weighted sum of all these values are calculated.

The information passed from one layer of neurons acts as inputs for the next layer of neurons. The inputs propagate through the neural network, activation function and cost function then finally yield the output.

Activation function

The weighted sum is then passed through an activation function. It has a very important role in ANN. This function controls the threshold for the output of ANN. Similar to a biological neuron which provides sensation when the impulse exceeds a particular threshold value, the ANN also only gives a particular output when the weighted sum crosses a threshold value.

The output

This is the output of ANN. The activation function yields this output from the weighted sum of the inputs.

ANN: a deep learning process

ANN is a deep learning process, the burning topic of data science. Deep learning is basically a subfield of Machine Learning. You may be familiar to the machine learning process and if not you can refer to this article for a quick working knowledge on it. Talking about deep learning, it is in recent times find its application in almost all ambitious projects. Starting from basic pattern recognition, voice recognition to face recognition, self-driving car, high-end projects in robotics and artificial intelligence deep learning is revolutionizing the modern applied science.

Read about supervised machine learning here

ANN is a very efficient and popular process of pattern recognition. But the process involves complex computations and several iterations. The advent of high-end computing devices and machine learning technologies have made our task much easier than ever. Users and researchers can now focus only on their research problem without taking the pain of implementing a complex ANN algorithm.

As time passes easier to use modules in various languages are developed encapsulating the complexity of such computation processes. The “Keras” is such a framework in Python which has made deep learning and artificial intelligence a common man’s interest and built on rather popular frameworks like TensorFlow, Theano etc. 

Here is an exhaustive article on python and how to use it

 We are going to use here this high-level API Keras to apply ANN.

Application of ANN using Keras library

Importing the libraries

The first step to start coding is to import all the libraries we are going to use. The basic libraries for any kind of data science projects are like pandas, numpy, matplotlib etc. The purpose of these libraries are discussed before in the article simple linear regression with python.

# first neural network with keras tutorial
import pandas as pd
from numpy import loadtxt
from keras.models import Sequential
from keras.layers import Dense

About the data

The example dataset I have used here for demonstration purpose has been downloaded from kaggle.com. The data collected by “National Institute of Diabetes and Digestive and Kidney Diseases”  contains vital parameters of diabetes patients belong to Pima Indian heritage.

Here is a glimpse of the first ten rows of the data set:

Diabetes data set for logistic regression
Diabetes data set for ANN

The data set has independent variables as several physiological parameters of a diabetes patient. The dependent variable is if the patient is suffering from diabetes or not. Here the dependent column contains binary variable 1 indicating the person is suffering from diabetes and 0 he is not a patient of diabetes.

dataset=pd.read_csv('diabetes.csv')
dataset.head()
# Printing data details
print(dataset.info) # for a quick view of the data
print(dataset.head) # printing first few rows of the data
dataset.tail        # to show last few rows of the data
dataset.sample(10)  # display a sample of 10 rows from the data
dataset.describe    # printing summary statistics of the data
pd.isnull(dataset)  # check for any null values in the data
Checking if the dataset has any null value

Creating variables

As we can see that the data frame contains nine variables in nine columns. The first eight columns contain the independent variables which are some physiological variables correlated with diabetes symptoms. The ninth column showes if the patient is diabetic or not. So, here the independent variables are stored in x and the dependent variable diabetes count is stored in y.

x=dataset.iloc[:,:-1].values
y=dataset.iloc[:,-1].values
print(x)
print(y)

Preprocessing the data

This is standard practice before we start with analysis on any data set. Especially if the data set has variables with different scales. In this data also we have variables which have a completely different scale of data. Some of them in fractions whereas some of them with big whole numbers.

To do away with such differences between the variables data standardization is very effective. The preprocessing module of sklearn package has a function called StandardScaler() which does the work for us.

#Normalizing the data
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x = sc.fit_transform(x)

Create a heat map

Before we proceed for analysis, we should have a through idea about the variables in study and their inter relationship. A very handy way to have a quick knowledge about the variables is to create a heat map.

The following code will make a heat map. The seaborn” package has the required function to do this.

# Creating heat map for correlation study
import seaborn as sns
corr = dataset.corr()
sns.heatmap(corr, 
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values)
plt.show()
Heat map for correlation study among the variables
Heat map for correlation study among the variables

The heat map is very good visualization technique to easily apprehend the relation between variables. The colour sheds are the indication of correlation here. The lighter shades depict a high correlation and as the shades get darker the correlation is decreased.

The diagonal elements of a heat map is always one as they are correlation between the same variable. As we expected we can find some variables here which have higher correlation which was not possible to identify from the raw data. For example pregnancies and age, insulin and glucose, skinthikness have a higher correlation.

Splitting the dataset in training and test data

For testing purpose, we need to separate a part of the complete dataset which will not be used for model building. The thumb rule is to use the 80% of data for modelling and keep aside the rest of the data. It will work as an independent dataset. Now we need to test the fitted model’s performance using this independent dataset.

# Splitting the data for training and testing
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test=train_test_split(x,y, test_size=0.20, random_state=0)

Here this data splitting task has been performed with the help of model_selection module of sklearn library. This module has an inbuilt function called train_test_split which automatically divides the dataset into two parts. The argument test_size controls the proportion of the test data. Here the test size is 0.2 so the test dataset will contain 20% of the complete data.

Modelling the data

So we have completed all the prerequisite steps before modelling the data. Here the response variable is a binary variable having 0 and 1 as output. A multilayer perceptron ANN is the best suited to model such data. In this type of ANN, each layer remains connected to each other and works as input layer for the immediate next neuron layer.

For using a multilayer perceptron, Keras sequential model is the easiest way to start. To use sequential model we have used model=sequential(). The activation function here is the most common relu function frequently used to implement neural network using Keras.

# define the keras model
model = Sequential()
model.add(Dense(12, input_dim=8, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

Compiling the model

As the model is defined we will now compile the model with adam optimizer and the loss function called binary_crossentropy. While the training process will continue in several iterations, we can check the model’s accuracy with the [‘accuracy‘] argument passed in metrics function.

# compile the keras model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

While compiling the model these two arguments loss and optimizer plays an important role. The loss function generally depends on the particular problem you are addressing through ANN. For example, if you have a regression problem then the loss function you will be using is Mean Squared Error (MSE).

In this case as we are dealing with a binary response variable so the loss function here is binary_crossentropy. If the response variable consists of more than two classes then the loss function should be categorical_crossentropy.

In a similar way the optimization algorithm used here is adam. There are several others also like RMSprop, Stochastic Gradient Descent (SGD) etc. and their selection has an impact on the tuning model’s learning and momentum.

Fitting the model

Fitting the model has again two crucial parameters. Initializing them with optimum values to a great extent determines model’s efficiency and performance. Here the epochs decides how many iterations will be there through the training set.

And the batch_size is as the name suggests is actually the batch of input samples passed at a time through the ANN. It increases the efficiency of the model as the model does not have to process the whole input at a time.

# fit the keras model on the training set
train=model.fit(x_train, y_train, epochs=100, batch_size=10)

Here I have mentioned batch_size with 10 will enter at a time and total epochs will be 100. See the below output screenshot, here first 10 epochs is captured with the model’s accuracy at every epoch.

Evaluating the model

As the model trained and compiled we can check the model’s accuracy. For the model’s accuracy, Keras has model. evaluate function which gives accuracy value as 68.24. But you have to keep in mind that this accuracy can vary and may get changed each time the ANN runs.

# evaluate the keras model
_,accuracy = model.evaluate(x_train, y_train)
print('Accuracy: %.2f' % (accuracy*100))

Prediction using the model

Now the model is ready for making prediction. The values of x_test are privided as ANN inputs.

# make probability predictions with the model
# make probability predictions with the model
predictions = model.predict(x_test)
# round predictions 
rounded = [round(x[0]) for x in predictions]
print(rounded[:10])
print(y_test[:10])

I have printed here both the predicted y_test results as well as the original y_test values (first 10 values only) and it is clear that the prediction is correct for all of them.

Comparing the predicted values and the original values of test set (first 10 values only)
Comparing the predicted values and the original values of test set (first 10 values only)

Visualizing the models performance

# Visualizing training process with validation and accuracies
import matplotlib.pyplot as plt
plt.plot(train.history['accuracy'])
plt.plot(train.history['val_accuracy'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()
plt.plot(train.history['loss']) 
plt.plot(train.history['val_loss']) 
plt.title('Model loss') 
plt.ylabel('Loss') 
plt.xlabel('Epoch') 
plt.legend(['Train', 'Test'], loc='upper left') 
plt.show()

Conclusion

So we have just completed our first deep learning model to solve a real world problem. This was a very simple problem with a smaller data size just for demonstration purpose. But the basic principal for fitting an ANN will be same everywhere irrespective of data complexity and size. Important is you should know how it works.

Future scope

We have obtained here an accuracy of ANN of 68.24 which has a lot of scopes to get improved. So we need to put further effort to improve the model. You can start with this by tweaking the number of layers the network has, the optimization and loss function used in the model definition and also the epochs and batch_size. Changing these parameters of the model may result in further higher accuracy.

For example in this particular example, if we increase the epochs number from 100 to 200 the accuracy increases to 77% !!!. It is quite a jump in the model efficiency. Likewise simple change in other parameters can also be very helpful.

If there is scope using more sample data in training the model also an effective way of increasing the model’s prediction efficiency. So, once you have a defined model in you hand there is ample scope you can always think of improving it.

Hope this article will help you to take big step forward towards the vast, dynamic and very interesting world of deep learning and AI.

References:

Logistic regression: classify with python

Logistic regression

Logistic regression is a very common and popularly used supervised classification process. When we have categorical data in our hand to make some prediction we tend to apply logistic regression. Classification is a very popular prediction technique. Almost 70% of real-world prediction problems involve categorical variable and hence amenable to classification.

Read about supervised machine learning here

This article covers the basic idea of logistic regression and its implementation with python. The reason behind choosing python to apply logistic regression is simply because Python is the most preferred language among the data scientists. And in the near future also it is going to rule the world of data science.

Here is an exhaustive article on python and how to use it

Why logistic regression not “classification”?

So why the name is “regression” when it performs classification? It is a very natural question you should be asking. So, the answer is it is basically a regression process which becomes a classification process when the process involves a decision threshold for the prediction. Deciding a threshold for the classification process is very important and tricky one too.

We need to decide the decision threshold depending on the particular case in hand. There can be four types of responses in case of classification problems which are “true positive”, “true negative”, “false positive” and “false negative” (will discuss them in a bit while discussing confusion matrix). We have to fix the probability of one type of occurrence while reducing another depending on its severity.

For example, take the case for a severe crime and it is to decide if the person should be hanged or not. It is a problem of binary classification with two outputs guilty or not guilty. Here the true positive case is the person found guilty when he actually has committed the crime. On the other hand, the true negative is the person found guilty when he has not committed the crime.

So, no doubt the true negative case here is of very serious type and should be avoided at any cost. Hence while fixing the decision threshold, you should try to reduce the probability of true negative while fixing the probability of true positive cases.

Here is an exhaustive article on machine learning with python

Logistic regression the basic idea

Though this process is used for classification, basically it is a regression process performed on discrete data. Unlike linear regression predicting response of a continuous variable, in logistic regression, we predict the positive outcome of a binary response variable.

Unlike linear regression which follows a linear function, a logistic regression has a sigmoid function.

Equation for logistic regression
Equation for logistic regression
Linear regression
Logistic regression

Classification types in logistic regression

Binary/binomial classification

In binary classification, the response under study can generally be classified into two groups. Examples of binary classification problems are almost everywhere in the real world.

Be it a medical test result to identify if any patient is suffering from a disease or not, a clinical test to declare a product is pass or fail in industrial quality control parameters to simple predicting whether it will rain or not. All of them are the problems of binary classification. As the response can be of only two types either positive (1) or negative (0) corresponding to every duality like “yes-no”, “pass-fail”, “male-female”, “win-loss” etc.

Multinomial classification

Here the response variable has more than two categories and they have no order. For example category of employees can be group A, Group B and Group C. They can not be arranged in any ascending or descending order.

A good example of such data can be the very famous iris data set of Sir Ronald A. Fisher regarded as the Father of statistics for his remarkable contribution. It is very much popular multivariate dataset and since long has been used as an example data set for any kind of pattern recognition problem.

The data set contains information on 3 species of iris plant with 50 instances about each species. The dependent variable here is the three species of iris plant without any order.

Ordinal classification

In this case like the multinomial variable, the response variable has more than two classes. But here the classes can be ranked in some order. Like the financial status of citizen “very poor”, “poor”, “lower middle class”, “middle class”, “rich”, “very rich”.

Any prediction problem may be a problem of binary classification or regression. Which prediction tool you will use depends on the variable type of the response variable. If the response variable is a categorical variable and have a binary response then binary classification is the solution. On the other hand, if the response is a continuous variable then we have to use regression for prediction.

For example, predicting the price of any product depending on its different specifications is a regression problem. But when we have to determine whether a customer will buy the product or not then it is certainly a problem of binary classification. Because here the response is discrete having only two types of responses possible “buy” and “not buy”.

Learn about supervised machine learning here

Application of logistic regression with python

So, I hope the theoretical part of logistic regression is already clear to you. Now it is time to apply this regression process using python.

So, lets start coding…

About the data

We already know that logistic regression is suitable for categorical data. So, the example dataset I have used here for demonstration purpose has been downloaded from kaggle.com. The data collected by “National Institute of Diabetes and Digestive and Kidney Diseases”  contains vital parameters of diabetes patients belong to Pima Indian heritage.

Here is a glimpse of the first ten rows of the data set:

Diabetes data set for logistic regression
Diabetes data set for logistic regression

The data set has independent variables as several physiological parameters of a diabetes patient. The dependent variable is if the patient is suffering from diabetes or not. Here the dependent column contains binary variable 1 indicating the person is suffering from diabetes and 0 he is not a patient of diabetes.

So, our task is to classify using logistic regression. And to predict as accurately as possible if a person is a diabetes patient from his different other vital parameters.

Importing the libraries

The first step to start coding is to import all the libraries we are going to use. The basic libraries for any kind of data science projects are like pandas, numpy, matplotlib etc. The purpose of these libraries are discussed before in the article simple linear regression with python.

# importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Reading the dataset

I have already mentioned about the dataset used here for demonstration purpose. The below code is to import the data and store in a dataframe called dataset.

dataset=pd.read_csv('diabetes.csv')
print(dataset)

Here is a glimpse of the dataset

Diabetes data frame in python
Diabetes data frame in python

Creating variables

As we can see that the data frame contains nine variables in nine columns. The first eight columns contain the independent variables which are some physiological variables correlated with diabetes symptoms. The ninth column showes if the patient is diabetic or not. So, here the independent variables are stored in x and the dependent variable diabetes count is stored in y.

x=dataset.iloc[:,1:2].values
y=dataset.iloc[:,-1].values
print(x)
print(y)

Splitting the dataset in training and test data

For testing purpose, we need to separate a part of the complete dataset which will not be used for model building. The thumb rule is to use the 80% of data for modelling and keep aside the rest of the data. It will work as an independent dataset. Now we need to test the fitted model’s performance using this independent dataset.

#****** Dividing the dataset into training and testing dataset
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size=0.2, random_state=0)

Here this data splitting task has been performed with the help of model_selection module of sklearn library. This module has an inbuilt function called train_test_split which automatically divides the dataset into two parts. The argument test_size controls the proportion of the test data. Here the test size is 0.2 so the test dataset will contain 20% of the complete data.

Application of logistic regression

Here we will be using the LogisticRegression class from sci-kit learn.

# Importing the logistic regression class and fitting the model
from sklearn.linear_model import LogisticRegression
model=LogisticRegression()
model.fit(x_train, y_train)

After importing LogisticRegression, we will create an instance of the class and then use it to fit the logistic regression on the training dataset.

Predicting using the test data

# Using the fitted model to predict using the test data
y_pred=model.predict(x_test)

As the model has been trained on the training data set, we will use it to get prediction of the test data set. The fitted model will generate a predicted data set called y_pred using x_test. We already know the original values corresponding to x_test which are in y_test. So we can compare how accurate the prediction is.

Calculating fit statistics

# Calculation different statistics to evaluate model fit
from sklearn import metrics
print("Acuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precision score:", metrics.precision_score(y_test, y_pred))
print("Recall score:", metrics.recall_score(y_test, y_pred ))

The sci-kit learn also have a class called metrics which has some useful functions to calculate fit statistics like accuracy score, precision score, recall score etc.

Model validation statistics
Model validation statistics

Here we have all the three statistics calculated. The accuracy score 0.82 suggests a good classification which suggests out of 10 observations the model can classify 8 observations correctly.

The precision and recall score are also good measure of classification process. The precision score is to measure the percentage of correct prediction. In this case, the precision score indicates that if using all the physical parameters of a person the logistic regression predicts that he/she is going to suffer from diabetes, then there is 76% chance that the prediction is correct.

The recall score of 61% says that if the test data set already has some diabetes patients, then in 61% cases the classification process can identify it.

You can further generate a more detailed report on the classification performance using classification_report() function from sci-kit learn. See below…

# Detailed classification report
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
Detailed classification report
Detailed classification report

Creating confusion matrix

Creating a confusion matrix is also an effective way to judge the model. In this case, a 2×2 matrix constitutes true negative, false negative, false positive and false negative values in the four quadrants of the matrix.

A confusion matrix example
A confusion matrix example

The below code is to create the confusion matrix using the metrics class of skit-learn library.

# Creating confusion matrix to check the accuracy of prediction
# import the metrics class
conf_matrix = metrics.confusion_matrix(y_test, y_pred)
conf_matrix
Confusion matrix
Confusion matrix

So, here is the desired confusion matrix. If we compare this matrix with the above model confusion matrix then we can say that the logistic regression has resulted 98 true negative, 9 false positive, 18 false negative and 29 true positive results.

Now, what do they mean? the terms are somewhat technical, so let me explain these terms in respect to this result. Here true negative means when the 0 predictions are correct. So here correct 0 predictions are 98. Likewise in 29 instances, the 1 predictions are correct so these are called true positives, the no. of false positives are 9 that is 9 predictions about 1 are wrong and lastly 18 predictions about 0 are wrong and they are the number of false negatives.

#Creating a heatmap for the confusion matrix
cm=conf_matrix
fig, ax = plt.subplots(figsize=(8, 8))
ax.imshow(cm)
ax.grid(False)
ax.xaxis.set(ticks=(0, 1), ticklabels=('Predicted 0s', 'Predicted 1s'))
ax.yaxis.set(ticks=(0, 1), ticklabels=('Actual 0s', 'Actual 1s'))
ax.set_ylim(1.5, -0.5)
for i in range(2):
    for j in range(2):
        ax.text(j, i, cm[i, j], ha='center', va='center', color='red')
plt.show()

Creating a ROC curve

A Receiver Operating Characteristic (ROC) curve is a good visualization technique to judge the efficiency of classification. The curve plots the true positives over the false positives and hence the optimization and adjustment of sensitivity along with specificity.

# Creating Reciever Operating Characteristic (ROC) curve
y_pred_proba = model.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test,  y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()
ROC curve for logistic regression
ROC curve for logistic regression

Here you can see an AUC score 0.87 which suggests a good classification. The score varies between 0 to 1. A score of 1 suggests perfect classification whereas any score below 0.5 suggests a poor classifier.

Conclusion

Logistic regression is a very uncomplicated classification technique based on a very simple logic. Thus computation resource required by it is comparatively much less. Another big plus of this technique is this process does not require feature scaling. So, no surprise that logistic regression has always been a favourite choice among data scientists to deal with classification problems.

But as a flip side of such simplicity logistic regression is not very efficient to perform classification when there are too many classes among the variables. It is also prone to overfitting and can not handle data with non-linear nature. There are modern machine learning techniques like Naive Bayes, support vector regression, Random Forest, decision tree etc. which are much more capable than logistic regression in handling complex data.

References

Random forest regression and classification using Python

Random forest regression and classification

As you all know that in today’s world of data explosion, machine learning plays a very crucial role to analyze such a huge amount of data. There are several machine learning algorithms which are making our lives easier to handle large database. Random forest algorithm is one of them and can be regarded as the most important and efficient supervised machine learning techniques.

Random forest is a kind of ensemble method of learning technique which makes a more accurate prediction by using more than one models at a time instead of only one machine learning method.

The speciality of the random forest is that it is applicable to both regression and classification problems. When the data is categorical, then it is the problem of classification, on the other hand, if the data is continuous, we should use random forest regression.

Random forest and decision tree

Random forest is a collection of decision trees where each decision tree has trained with a different dataset. The more decision tree a random forest model includes, the more robust and accurate its result becomes. It is like as we consider a forest a robust one if it has many trees. 

Random forest with n number of decision trees
Random forest with n number of decision trees

Random forest actually makes a final prediction from the prediction obtained from each of the decision tree models to overcome the weakness of a single decision tree model. In this sense, the random forest is a bagging type of ensemble technique. 

Now to understand what is bagging we need to know a little about the ensemble method.

Ensemble method 

The random forest provides much more precise result mainly because of the fact that it is a kind of ensemble method, which uses more than one machine learning method at a time to improve the accuracy of the prediction.

A schematic diagram of ensemble method

Bagging

The name is actually Bootstrap Aggregation. It is essentially a random sampling technique with replacement. That means here once a sample unit is selected, it is again replaced back for further future selection. This method works best with algorithms which tend to have higher variance and bias, like decision tree algorithm.

Bagging method runs different model separately and for the final prediction output aggregates each model’s estimation without any bias to any model.

The other ensemble modelling technique is:

Boosting

As an ensemble learning method, boosting also comprises a number of modelling algorithm for prediction. It associates weight to make a weak learning algorithm stronger and thus improving the prediction. The learning algorithms also learn from each other to boost the overall model performance.

In the case of decision tree, the main problem is that the prediction is hugely dependent on the training dataset. As soon as the training data changes, the prediction result also differs. And many a time the decision tree also suffers from the problem of overfitting.  

Advantages of random forest

Different modelling approaches have their own merits and demerits. The beauty of this modelling approach is that it is very efficient in capturing tabular data both numerical and categorical nature with th condition that the category is not more than one hundred. 

It is a single algorithm which is capable of performing both classification and regression tasks depending on the nature of the data. 

Besides as it combines a no. of decision trees in its process, the prediction becomes much more accurate. If we imagine a decision tree as a single tree then the random forest is literally a forest comprising many decision trees, hence the name random forest.

Random forest is capable of handling large database and thousands of input variables.

This machine learning method also comprises a very efficient method of handling missing observation in the dataset.

Application of random forest for regression using Python

This is what you must be waiting for, using python libraries to apply random forest with your data. So lets start coding. We will start with random forest regression with continuous data and then we will take an example of categorical data and apply random forest classification technique.

Random forest regression algorithm of sci-kit learn library is very popular ensemble modelling technique. We will use the RandomForestRegression() class here to perform the regression.

About the data set

The dataset I have used here for demonstration purpose is downloaded from https://www.kaggle.com. The dataset contains the height and weight of persons and a column with their genders. The original dataset has more than thousands of rows, but for this regression purpose, I have used only the first 50 rows containing data on 25 male and 25 females.

So, let’s jump to the most fun part of the article, that is coding with python:

Importing libraries

The first step to start coding is to import all the libraries we are going to use. The basic libraries for any kind of data science projects are like pandas, numpy, matplotlib etc. The purpose of these libraries are discussed before in the article simple linear regression with python.

# importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Reading the dataset

I have already mentioned about the dataset used here for demonstration purpose. The below code is to import the data and store in a dataframe called dataset.

dataset=pd.read_csv('weight-height.csv')
print(dataset)

Here is a glimpse of the dataset

Dataset for random forest regression

Creating variables

As we can see that the dataframe contains three variables in three columns. We are interested in only the last two columns. We want to regress the weight of a person using the height of him/her. So, here the independent variable height is stored in x and the dependent variable weight is stored in y.

x=dataset.iloc[:,1:2].values
y=dataset.iloc[:,-1].values
print(x)
print(y)

Fitting random forest regression

The below code used the RandomForestRegression() class of sklearn to regress weight using height. As the fit is ready, I have used it to create some prediction with some unknown values not used in the fitting process. The predicted weight of a person with height 45.8 is 100.50

# Application of random forest regression  
from sklearn.ensemble import RandomForestRegressor # this is the required algorithm for the task 
regressor = RandomForestRegressor(n_estimators = 100, random_state = 0) 
  
# fitting the random forest regression with the data
regressor.fit(x, y)  
#predicting the output
Y_pred = regressor.predict(np.array([45.8]).reshape(1, 1))  
Y_pred
The predicted value for height 45.8

Creating a fit plot with the predicted values

The following code is to visualize the prediction result against the original values. This is a way through which we can visualize how good the regression is performing.

# Creating a plot with the predicted result
X_grid = np.arange(min(x), max(x), 0.01)  
  
# Making the one dimensional X_grid a two dimensional variable                  
X_grid = X_grid.reshape((len(X_grid), 1)) 
  
# Create a scatter plot with the original variables
plt.scatter(x, y, color = 'blue')   
  
# Creating a line with the predicted data
plt.plot(X_grid, regressor.predict(X_grid),  
         color = 'blue')  
plt.title('Random Forest Regression') 
plt.xlabel('Position level') 
plt.ylabel('Salary') 
plt.show()

So, here is the regression fit plot.

Fit plot for random forest regression
Fit plot for random forest regression

Application of random forest for classification using Python

So, we learned about random forest regression and how we can implement it with python. Now it is time to implement random forest classification. The same sci-kit learn library we used for regression also has a very efficient algorithm for performing this classification process. Here we will apply the RandomForestClassification() function of this library.

So, let’s start coding to perform classification using random forest algorithm.

About the data set

The data set used here is the very famous iris data set of Sir Ronald A. Fisher regarded as the Father of statistics for his remarkable contribution. It is very much popular multivariate dataset and since long has been used as an example data set for any kind of pattern recognition problem.

The data set contains information on 3 species of iris plant with 50 instances about each species. All the three classes are linearly separable from each other. The dependent variable here is the species of iris plant and the three independent variables are sepal length, sepal width, petal length and petal width measured in cm.

The idea behind the data set is that the particular species of any iris plant can be identified with these four variables determining the flower characteristics. Here also we are going to use this random forest classification algorithm to classify the data. And thereafter using that fitted classification model to predict the species of an unknown iris plant using the independent variables.

So, lets start coding…

Importing libraries

The first step to start coding is to import all the libraries we are going to use. The basic libraries for any kind of data science projects are like pandas, numpy, matplotlib etc. and with them sklearn library for the random forest classification algorithm.

Know the functions of all these libraries here.

# importing libraries
import pandas as pd # for dataframe operations
import numpy as np # for matrix operations
from sklearn.model_selection import train_test_split # for splitting the dataset for training and testing dataset
from sklearn import datasets #importing the sklearn library for the iris dataset
from sklearn.ensemble import RandomForestClassifier # for applying random forest classification

Loading the dataset

The iris dataset being a popularly used example dataset is already provided with sklearn library. We need to load the dataset in our workspace before we are going to use it. I am storing the dataset with the name dataset.

# loading the iris dataset
dataset=datasets.load_iris() 

Now to check the dataset we need to check the target and features i.e. the dependent and independent variable classes of the data. Here we will print these information to check them.

print(dataset.target_names) #printing the target names
print(dataset.feature_names)#printing the feature names
the output view

Storing the data into a dataframe

The data is loaded into workspace but until it is in the form of a dataframe we can not apply other data analysis functions. So here lets store the data into a dataframe named test.

# creating a dataframe from the dataset
test=pd.DataFrame({'sepal length':dataset.data[:,0],
                  'sepal width':dataset.data[:,1],
                  'petal length':dataset.data[:,2],
                  'petal width':dataset.data[:,3],
                   'species':dataset.target})
test

Below is a view of few rows of the newly created dataframe of dimension 150X5.

Dataset for random forest classification
View of the dataframe containing the iris dataset

Crating dependent and independent variables

To apply classification algorithm, first of all we need the dependent and independent variables. So here we will store these variables fetching data from dataset.

Now as we have created two variables x and y storing independent and dependent values respectively, we need to split them. This splitting is to create training and testing dataset with a proportion of 80% and 20% of the total data respectively.

# Dividing the data for training and testing 
x=test[['sepal length','sepal width', 'petal width']]
y=test['species']
x_train, x_test,y_train, y_test=train_test_split(x,y,test_size=0.2, random_state=0)

Application of Random Forest Classification

The below code does the main task of classifying the data using the RandomForestClassifier() of sklearn library. Then a variable pred is created to store the predicted values applying the classification fit on the test dataset.

# applying RandomForest classification algorithm
classify=RandomForestClassifier()
classify.fit(x_train, y_train)
pred=classify.predict(x_test)

Checking the accuracy of the classification fit

The sklearn library also has a function called accuracy_score() which tells how accurate the classification is. Here the accuracy value we get is 0.93, which is quite satisfactory.

# testing the accuracy of the result
from sklearn import metrics
print("Acuracy:",metrics.accuracy_score(y_test, pred))

References

Support Vector Regression using Python

Support Vector Regression using Python

Support vector regression (SVR) is a kind of supervised machine learning technique. Though this machine learning technique is mainly popular for classification problems and known as Support Vector Machine, it is well capable to perform regression analysis too. The main emphasis of this article will be to implement support vector regression using python.

Selecting Python for its application is because Python is the future of data science. Python is already the most popular general-purpose programming language amongst the data scientists. Python is an old language and came into existence during the 90s. But it takes decades for the data science enthusiasts to pick it as the most favourite tool. During 2010 it starts to gain popularity very rapidly.

You can get details about python and its most popular IDE pycharm here.

When we use support vector machine for the classification problem, then it is finding out a hyperplane to classify different classes exists in the data. On the other hand, if it is a regression problem then the hyperline is rather a continuous line predicting the response for some known predictors.

Support Vector Regression and hyperplane

See the above figure, here the two classes of observations that are red and blue classes are classified using a hyperlink. It looks very easy, is not it? But sometimes a simple straight line is not enough to classify them. See the below figure.

In this case, no straight line can not completely classify all the points. So here we have to create a third dimension.

As a new third axis has been introduced now we can see that the classes are now can be easily done. Now how it will look if the figure is again converted to its two dimensional version? see it below.

So, a curved hyperline has now separated the classes very effectively. This is what a support vector regression does. It finds a hyperplane to classify the points and then any new point gets assigned its class depending on which side of the hyperplane it resides.

How SVR is different from traditional regression?

It is a very basic question. Why should one go for support vector regression? how it is different from the traditional way of doing regression i.e. OLS (Ordinary Least Square) method?

In OLS method our purpose is to minimize the error as much as possible. Here we try to find a line which has the least distance from all the points. In mathematical notation, the line should fulfil the following condition:

Where yi is the observed response and yi_hat is the predicted response. So, the line should produce the minimum value for the sum of square of the difference between these two values.

But in case of support vector regression, it allows the user to select a range within which the error will be limited. So, the hyperplane/line will be lying within this range set by the researcher. These range is enclosed by two decision boundaries.

So, in the above figure the green line in between is the hyperline. The two black lines at the same distance from the hyperplane are limiting the error of the prediction. The task of support vector regression is to find out this hyperline with maximum number of points between this two decision boundaries.

I think the theoretical idea discussed above will give you a clear enough idea about what is support vector regression and what purpose it serves. With this knowledge we will now dive into its implementation part.

Application of Support Vector Regression using Python

So let’s start our main business that is application of Support Vector Regression using Python. To start coding we have to call the same libraries as we used in Simple Linear Regression and Multiple Linear Regression before.

Calling the libraries

We have to import the Pandas, numpy, matlotlib and seaborn.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn

Importing the dataset

Here I have used an imaginary database which contains data on tree total biomass above the ground and several other tree physical parameters like tree commercial bole height,  diameter, height, first forking height, diameter at breast height, basal area. Tree biomass is the dependent variable here which depends on all other independent variables.

Here is a glimpse of the database:

Dataset for  regression

So here the dependent variable is total_biomass(kg) and we will regress it using all the independent variables here.

dataset=pd.read_csv('tree.csv')
dataset
Glimpse of the database

Describing the dataset

To have a first hand idea about the data in hand, the describe function of Pandas is very useful. All the basic descriptive statistics help us to know the data well.

# take a look of the dataset
dataset.describe()
Descriptive statistics of the dataset

Removing the rows with missing values

This is an important step before you start analysing your data. The raw dataset contains several rows with missing values. These missing values are a big problem during analysis. Thankfully python has a very useful function called dropna() which makes all the rows with missing values disappear.

dataset.columns
#printing the number of rows and columns of the dataset
print(dataset.shape)
# removing the rows with mmissing values
dataset.dropna(inplace=True)
# again print the row and columns to see there is any change in number of rows
print(dataset.shape)
print(dataset.head(5))

Here we can see in the below output, the number of rows and comumns of the dataset has been displayed twice. The values are same in both the cases. This is because the dataset does not have any missing values. So, before and after applying dropna() the rows number is the same.

If there had been missing values, the row numbers in the later case would be lesser.

Producing a heatmap

A heatamp is a very good way to get an idea about the relationship between the variables. The seaborn library has this function of producing heatmap where colour varies from darker shade to lighter one as the correlation between the variables get stronger.

# producing a heatmap to show the correlation between the variables
f=plt.subplots(figsize=(10,10))
sn.heatmap(dataset.corr(),annot=True,fmt='.1f',color='green')
Heat map of the variables showing the correlation between them

Creating variables from the dataset

If you check the data set, the last column is for dependent variable and rest are all for independent variables. So, here I have stored all the independent variables in variable x and the dependent in y.

So, here x is a two dimensional arrow whereas y is one dimensional. To make the variables amenable to further analysis, they need to be two dimensional. So, here a function reshape() has been used to make y a two dimensional array.

x=dataset.iloc[:,: -1].values
y=dataset.iloc[:,-1].values
# to convert the one dimensional array to a two dimensional array
y=y.reshape(-1,1)

Feature scaling of the variables

Before using the variables in support vector regression, they need to be feature scaled. The following code is for transforming both the variables.

# Feature scaling
from sklearn.preprocessing import StandardScaler
std_x=StandardScaler()
std_y=StandardScaler()
x2=std_x.fit_transform(x)
y2=std_y.fit_transform(y)

Fitting the Support Vector Regression

Here comes the most important part of coding where we will perform Support Vector Regression using the SVR() function of SVM module of sklearn library.

# fitting SVR 
from sklearn.svm import SVR
regressor= SVR(kernel='rbf')
regressor.fit(x2,y2)

Visualizing the prediction result of SVR

As we get the model, the next step is to use the model for prediction.

# visualizing the model performance
plt.scatter(x2[:,0],y2,color='red')
plt.scatter(x2[:,0],regressor.predict(x2),color='blue')
plt.title('Prediction result of SVM')
plt.xlabel('Tree CBH')
plt.ylabel('Tree Biomass')

For plotting the predicted output I have selected the variable tree CBH. In the scatter diagram, the red points represent predicted values and the blue ones are the observed values. The predicted value plotted against the independent variable clearly show a close match with the observed values. So, we can conclude that the model performs well enough for predicting Tree Biomass based on different tree physical parameters.

References

Multiple Linear Regression with Python

Multiple linear regression

Multiple linear regression(MLR) is also a kind of linear regression but unlike simple linear regression here we have more than one independent variables. Multiple linear regression is also known as multivariate regression. As in real-world situation, almost all dependent variables are explained by more than variables, so, MLR is the most prevalent regression method and can be implemented through machine learning.

Mathematical equation for Multiple Linear Regression

An MLR model can be expressed as:

Yn = a0 + a1Xn1 + a2Xn2 + ⋯ + aiXi + ∈n → (Xn1 + ⋯ + Xni ) + ∈n

In the above model, the variable Yn represents response for case n and it has a deterministic part and a stochastic part; a0is the intercept, i is no. of independent variables, ai and Xi are the regression coefficients and values of independent variables, respectively and ivaries from 1 to n

The main purpose of applying this regression technique is to develop a model which can explain the variance in the response as much as possible using the independent variables. The ratio of the explained variance by the model to the total variance of the response is known as the coefficient of determination and denoted by R2. We will discuss this statistic in detail later. 

But it is an important parameter in regression modelling to ascertain how good the model is. The value of R2 varies between 0 to 1. Now three situations regarding the fitting of the model we may face which are underfitted model, good fit and overfitted model.

Underfit model

This situation arises when the value of R is low. Low R2 value indicates that the proposed model is not explaining the variation of the response adequately. So, the model needs improvement.

Good-fit model

Like, in this case, we have a good R2 value. Which suggests a good fit of the model and it can be used for prediction.

Overfit model

Sometimes models become too complex with lots of variables and parameters. Such complex models get trained by the data too well and give a very high R2 value almost close to 1.0. But they can not predict well when tested with a different set of data. This is because the model being too complex becomes too specific to a particular situation. Such models are called overfitted models.

Dataset used

The dataset used here is the same we used in the Simple Linear Regression. But in this case all the explanatory/independent variables were considered for modelling purpose. The database is an imaginary one and based on my experience of modelling tree data. 

The dataset contains data on tree total biomass above the ground and several other tree physical parameters like tree commercial bole height,  diameter, height, first forking height, diameter at breast height, basal area. Tree_biomass is the dependent variable here which depends on all other independent variables.

Here is a glimpse of the database:

If you find any difficulty to understand the variables, just don’t bother about their names. Take them as two categories of variables, one is dependent variable, I have denoted it with y here and others are independent variable1, 2, 3 etc. Important is the relationship between these two categories of variables. Whatever their names maybe, you just have to have some experience in their relations.

Assumptions for multiple linear regression

We conduct the regression process assuming some conditions. Without holding these conditions, it is not possible to proceed with the regression process. These are called regression assumptions and they are as below:

Assumption of linearity:

There must be a linear relationship between the independent variables and the response variable. The variables in this imaginary dataset have a linear relationship between them. You can easily check this property by plotting the response variable against each of the explanatory variables. 

Assumption of Homoscedasticity:

The residuals or errors that is the difference between observed and estimated values must have constant variance.

Assumption of multivariate normality:

The residuals should follow a normal distribution. We can prepare a normal quantile-quantile plot to check this assumption.

Assumption of absence of multicollinearity:

There should be no multicollinearity between the independent variables i.e. the independent variables should not be linearly related to each other.

Application of Multiple Linear Regression using Python

The main purpose of this article is to apply multiple linear regression using Python. This is the most important and also the most interesting part. So let’s jump into writing some python code. Like simple linear regression here also the required libraries have to be called first.

Calling the required libraries

We will be using fore main libraries here. For handling data frame and arrays NumPy and panda, for creating plots matplotlib and for metrics operations sklearn. These are the most important libraries for data science applications. 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import metrics

Importing the dataset

To import the tree dataset as mentioned earlier we will use the import function of panda library.

***** Importing the dataset ***********
dataset=pd.read_csv('tree.csv')

Defining variables

Now the next important task is to tell Python about the dependent and independent variables of the dataset. As the protocol says we will store the dependent variable in y and the independent variables in x. As I have already explained above the dataset contains one dependent variable and 7 independent variables.

So we will store the variables in two NumPy arrays. As x has to store 7 independent variables, it has to be a 2-dimensional array. Whereas being a variable with only one column, y can do with one dimension. So, the python code for this purpose is as below:

#***** Defining variables *************
x=dataset.iloc[:,: -1].values
y=dataset.iloc[:,-1].values

Here the “:” denotes the rows. As the dataset contains the dependent value i.e. tree_biomass values as the extreme right column so, python indexes it with -1.

Checking the assumption of the linear relationship between variables

For example, here I have plotted the tree_height against the dependent variable tree_biomass. Although it is evident that with the increase of tree height the biomass will certainly increase. Still, a scatterplot is a very handy visualization technique to double-check the property. You can prepare this plot very easily using the below code:

#********* Plotting dependent variable against any independent variable 
plt.scatter(x[:,2],y) # accessing the variable tree_height
plt.title("Checking linearity between dependent and independent variables")
plt.xlabel("Tree height")
plt.ylabel("Tree biomass")

I have stored the variables in numpy array earlier. So, to access them we have to just mention which variable we intend to plot. For plotting we have used the plt function of matplotlib library.

And here is the plot:

The plot suggests almost a linear relationship between the variables.

Splitting the dataset in training and test data

For testing purpose, we need to separate a part of the complete dataset which will not be used for model building. The thumb rule is to use the 80% of data for modelling and keep aside the rest of the data. It will work as an independent dataset once we come up with the model and need to test it.

#****** Dividing the dataset into training and testing dataset
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size=0.2, random_state=0)

Here this data splitting task has been performed with the help of model_selection module of sklearn library. This module has an inbuilt function called train_test_split which automatically divides the dataset into two parts. The argument test_size controls the proportion of the test data. Here it has been fixed to 0.2 so the test dataset will contain 20% of the complete data.

Application of multiple linear regression

Here comes the main part of this article that is using the regression to regress the response using the known values of more than one independent variables. As in the above section, we have already created train dataset. The following code will use this train data for model building.

#********* Application of regression
from sklearn.linear_model import LinearRegression
regressor=LinearRegression()
regressor.fit(x_train, y_train)

As it is also a linear regression method, so the linear_model module of sklearn library is the one containing the required function LinearRegression. Regressor is an instance created to apply the LinearRegression function.

Getting the regression coefficients for the regression equation

As the regression is done, we need the regression equation. This equation is actually the relation between the dependent and independent variables defined by some coefficients. Using these coefficients we can determine how a unit change in any of the independent variables is going to affect the dependent variable.

#******** Getting the coefficients stored in a dataframe
#*****************************************************************
# storing the column names of independent variables
pos=[1,2,3,4,5,6,7]          
colnames=dataset.columns[pos]
print(colnames)
# creating a dataframe storing the coefficients along with the independent variable names
regressor.intercept_
coef_df=pd.DataFrame(regressor.coef_,colnames,columns=['Coefficients'])
coef_df

In the above section of code, you can see that first of all the position of the independent variables are stored in a variable. And then the corresponding coefficients are fetched from the instance regressor created from LinrarRegression function of linear_model module of sklearn. The coefficients are from regressor.coef_ and the intercept in regressor.intercept_.

Printing the regression coefficients

The regression equation

With the help of these coefficients now we can develop the multiple linear regression.

The multiple linear regression equation

So, this is the final equation for the multiple linear regression model.

Using the model to predict using the test dataset

Now we have the model in our hand. But how can we test its efficiency? If the model is a good one then it should have the capability to predict with precision. And to test that we will need independent data which was not involved during model building.

Here comes the role of test dataset that we kept aside at the very beginning. We will predict the response using the test dataset and compare the prediction with the observations we already have in our hand. The following code will do the trick for us.

And here is the comparison. I have created a dataframe with the observed and predicted values side by side for the ease of comparison.

Comparing the original and predicted values

In the above figure, I have shown only the first 15 values of the dataframe. But it is enough to show that the prediction is satisfactory. 

Goodness of fit of the model 

We have tested the data and got a good prediction using the model. However, we have not quantified yet. We do not have any number to ascertain how good is the model. Here I will discuss such fit statistics that are very useful in this respect. If we have to compare multiple models then these numbers play a crucial role to find the best out of them.

The following code will deliver fit statistics popularly used to judge the goodness of any statistical model. These are coefficient of determination denoted as R2 is the proportion of variance exists in the response variable explained by the proposed model. So the higher its value better is the model. 

Coefficient of determination (R2)

Suppose our test dataset has n set of independent and dependent variables i.e. (x1,x2,…,xn), (y1,y2,…,yn)respectively. Now using our developed model the prediction we achieved has the predicted values (v1,v2,…,vn). So, the total sum of square will be:

This is the total existing variation in the response variable.

Now the variation explained by the model we developed is the regression sum of square and can be calculated as

So as the definition of the coefficient of determination goes, it can be calculated as:

Again it can be farther simplified by breaking down the regression sum of square as the variance explained subtracting the unexplained variance from the total variance. The unexplained variance is actually the variance the model is not able to explain. It is also known as error or residual sum of square and calculated as:

So, now we can rewrite the equation of R2 as

#****** Calculating fit statistics
r_square=regressor.score(x_train, y_train)
print('Coefficient of determination(R square):',r_square)
from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_predict))
print('Mean Squared Error:', metrics.mean_squared_error(y_test,y_predict))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_predict)))

Mean Absolute Error(MAE)

This is another popular measure for model fit. As the name suggests, it is the simple difference between observed and predicted values. As we are only interested in the deviations, so we will take here the absolute value of the differences. So the expression will be:

As it measures the error of the estimated values so a lowe MAE suggests better model.

Mean Squared Error (MSE)

This is also a measure of the deviation of the model estimation from that of the original values. But instead of the absolute values, we will take the squared values of the deviations. So many a time it is also called Mean Squared Deviation (MSD) and calculated as:

Root Mean Squared Error (RMSE)

As the name suggests, this measure of fit first calculates the difference between the observed and model-predicted values, takes the square of each error then calculates the mean and ultimately calculates the square root to get the RMSE. So its equation is:

Fit statistics

How can the fitting further be improved?

There is always scope for improving the model so that it can give more precise prediction. As we already know that the main purpose of Multiple Linear Regression is to ascribe the variance of response variable as much as possible amongst the independent variables.

Now here lies the trick of improving the prediction of multiple linear regression model. The response variable you are dealing here with gets affected by a number of explanatory variables. Some of them are straight way visible to us and we can say with confindence that they are main contributor towards the response. And all together they can give you a good explanation too.

But with a good knowledge of the domain one can identify many other variables that are not directly recognizable as causal effects. For an example if we take the example of any agriculture experiment, crop yield is determined by so many direct, indirect, physiological, chemical, weather variable, soil condition etc.

So, the skill and domain knowledge of the researcher play a viral role to choose variable wisely in order to improve the model’s fit. Using too less variable will result in a poor R2 whereas using too many variables may produce a very complex model with a very high R2. In both of these scenario model’s performance will not be up to the mark.

References:

  • https://www.wikipedia.org/
  • https://www.statisticshowto.com/
  • https://towardsdatascience.com/