Support Vector Regression using Python

Support Vector Regression using Python

Support vector regression (SVR) is a kind of supervised machine learning technique. Though this machine learning technique is mainly popular for classification problems and known as Support Vector Machine, it is well capable to perform regression analysis too. The main emphasis of this article will be to implement support vector regression using python.

Selecting Python for its application is because Python is the future of data science. Python is already the most popular general-purpose programming language amongst the data scientists. Python is an old language and came into existence during the 90s. But it takes decades for the data science enthusiasts to pick it as the most favourite tool. During 2010 it starts to gain popularity very rapidly.

You can get details about python and its most popular IDE pycharm here.

When we use support vector machine for the classification problem, then it is finding out a hyperplane to classify different classes exists in the data. On the other hand, if it is a regression problem then the hyperline is rather a continuous line predicting the response for some known predictors.

Support Vector Regression and hyperplane

See the above figure, here the two classes of observations that are red and blue classes are classified using a hyperlink. It looks very easy, is not it? But sometimes a simple straight line is not enough to classify them. See the below figure.

In this case, no straight line can not completely classify all the points. So here we have to create a third dimension.

As a new third axis has been introduced now we can see that the classes are now can be easily done. Now how it will look if the figure is again converted to its two dimensional version? see it below.

So, a curved hyperline has now separated the classes very effectively. This is what a support vector regression does. It finds a hyperplane to classify the points and then any new point gets assigned its class depending on which side of the hyperplane it resides.

How SVR is different from traditional regression?

It is a very basic question. Why should one go for support vector regression? how it is different from the traditional way of doing regression i.e. OLS (Ordinary Least Square) method?

In OLS method our purpose is to minimize the error as much as possible. Here we try to find a line which has the least distance from all the points. In mathematical notation, the line should fulfil the following condition:

Where yi is the observed response and yi_hat is the predicted response. So, the line should produce the minimum value for the sum of square of the difference between these two values.

But in case of support vector regression, it allows the user to select a range within which the error will be limited. So, the hyperplane/line will be lying within this range set by the researcher. These range is enclosed by two decision boundaries.

So, in the above figure the green line in between is the hyperline. The two black lines at the same distance from the hyperplane are limiting the error of the prediction. The task of support vector regression is to find out this hyperline with maximum number of points between this two decision boundaries.

I think the theoretical idea discussed above will give you a clear enough idea about what is support vector regression and what purpose it serves. With this knowledge we will now dive into its implementation part.

Application of Support Vector Regression using Python

So let’s start our main business that is application of Support Vector Regression using Python. To start coding we have to call the same libraries as we used in Simple Linear Regression and Multiple Linear Regression before.

Calling the libraries

We have to import the Pandas, numpy, matlotlib and seaborn.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn

Importing the dataset

Here I have used an imaginary database which contains data on tree total biomass above the ground and several other tree physical parameters like tree commercial bole height,  diameter, height, first forking height, diameter at breast height, basal area. Tree biomass is the dependent variable here which depends on all other independent variables.

Here is a glimpse of the database:

Dataset for  regression

So here the dependent variable is total_biomass(kg) and we will regress it using all the independent variables here.

dataset=pd.read_csv('tree.csv')
dataset
Glimpse of the database

Describing the dataset

To have a first hand idea about the data in hand, the describe function of Pandas is very useful. All the basic descriptive statistics help us to know the data well.

# take a look of the dataset
dataset.describe()
Descriptive statistics of the dataset

Removing the rows with missing values

This is an important step before you start analysing your data. The raw dataset contains several rows with missing values. These missing values are a big problem during analysis. Thankfully python has a very useful function called dropna() which makes all the rows with missing values disappear.

dataset.columns
#printing the number of rows and columns of the dataset
print(dataset.shape)
# removing the rows with mmissing values
dataset.dropna(inplace=True)
# again print the row and columns to see there is any change in number of rows
print(dataset.shape)
print(dataset.head(5))

Here we can see in the below output, the number of rows and comumns of the dataset has been displayed twice. The values are same in both the cases. This is because the dataset does not have any missing values. So, before and after applying dropna() the rows number is the same.

If there had been missing values, the row numbers in the later case would be lesser.

Producing a heatmap

A heatamp is a very good way to get an idea about the relationship between the variables. The seaborn library has this function of producing heatmap where colour varies from darker shade to lighter one as the correlation between the variables get stronger.

# producing a heatmap to show the correlation between the variables
f=plt.subplots(figsize=(10,10))
sn.heatmap(dataset.corr(),annot=True,fmt='.1f',color='green')
Heat map of the variables showing the correlation between them

Creating variables from the dataset

If you check the data set, the last column is for dependent variable and rest are all for independent variables. So, here I have stored all the independent variables in variable x and the dependent in y.

So, here x is a two dimensional arrow whereas y is one dimensional. To make the variables amenable to further analysis, they need to be two dimensional. So, here a function reshape() has been used to make y a two dimensional array.

x=dataset.iloc[:,: -1].values
y=dataset.iloc[:,-1].values
# to convert the one dimensional array to a two dimensional array
y=y.reshape(-1,1)

Feature scaling of the variables

Before using the variables in support vector regression, they need to be feature scaled. The following code is for transforming both the variables.

# Feature scaling
from sklearn.preprocessing import StandardScaler
std_x=StandardScaler()
std_y=StandardScaler()
x2=std_x.fit_transform(x)
y2=std_y.fit_transform(y)

Fitting the Support Vector Regression

Here comes the most important part of coding where we will perform Support Vector Regression using the SVR() function of SVM module of sklearn library.

# fitting SVR 
from sklearn.svm import SVR
regressor= SVR(kernel='rbf')
regressor.fit(x2,y2)

Visualizing the prediction result of SVR

As we get the model, the next step is to use the model for prediction.

# visualizing the model performance
plt.scatter(x2[:,0],y2,color='red')
plt.scatter(x2[:,0],regressor.predict(x2),color='blue')
plt.title('Prediction result of SVM')
plt.xlabel('Tree CBH')
plt.ylabel('Tree Biomass')

For plotting the predicted output I have selected the variable tree CBH. In the scatter diagram, the red points represent predicted values and the blue ones are the observed values. The predicted value plotted against the independent variable clearly show a close match with the observed values. So, we can conclude that the model performs well enough for predicting Tree Biomass based on different tree physical parameters.

References