As you all know that in today’s world of data explosion, machine learning plays a very crucial role to analyze such a huge amount of data. There are several machine learning algorithms which are making our lives easier to handle large database. Random forest algorithm is one of them and can be regarded as the most important and efficient supervised machine learning techniques.
Random forest is a kind of ensemble method of learning technique which makes a more accurate prediction by using more than one models at a time instead of only one machine learning method.
The speciality of the random forest is that it is applicable to both regression and classification problems. When the data is categorical, then it is the problem of classification, on the other hand, if the data is continuous, we should use random forest regression.
Random forest and decision tree
Random forest is a collection of decision trees where each decision tree has trained with a different dataset. The more decision tree a random forest model includes, the more robust and accurate its result becomes. It is like as we consider a forest a robust one if it has many trees.
Random forest actually makes a final prediction from the prediction obtained from each of the decision tree models to overcome the weakness of a single decision tree model. In this sense, the random forest is a bagging type of ensemble technique.
Now to understand what is bagging we need to know a little about the ensemble method.
The random forest provides much more precise result mainly because of the fact that it is a kind of ensemble method, which uses more than one machine learning method at a time to improve the accuracy of the prediction.
The name is actually Bootstrap Aggregation. It is essentially a random sampling technique with replacement. That means here once a sample unit is selected, it is again replaced back for further future selection. This method works best with algorithms which tend to have higher variance and bias, like decision tree algorithm.
Bagging method runs different model separately and for the final prediction output aggregates each model’s estimation without any bias to any model.
The other ensemble modelling technique is:
As an ensemble learning method, boosting also comprises a number of modelling algorithm for prediction. It associates weight to make a weak learning algorithm stronger and thus improving the prediction. The learning algorithms also learn from each other to boost the overall model performance.
In the case of decision tree, the main problem is that the prediction is hugely dependent on the training dataset. As soon as the training data changes, the prediction result also differs. And many a time the decision tree also suffers from the problem of overfitting.
Advantages of random forest
Different modelling approaches have their own merits and demerits. The beauty of this modelling approach is that it is very efficient in capturing tabular data both numerical and categorical nature with th condition that the category is not more than one hundred.
It is a single algorithm which is capable of performing both classification and regression tasks depending on the nature of the data.
Besides as it combines a no. of decision trees in its process, the prediction becomes much more accurate. If we imagine a decision tree as a single tree then the random forest is literally a forest comprising many decision trees, hence the name random forest.
Random forest is capable of handling large database and thousands of input variables.
This machine learning method also comprises a very efficient method of handling missing observation in the dataset.
Application of random forest for regression using Python
This is what you must be waiting for, using python libraries to apply random forest with your data. So lets start coding. We will start with random forest regression with continuous data and then we will take an example of categorical data and apply random forest classification technique.
Random forest regression algorithm of sci-kit learn library is very popular ensemble modelling technique. We will use the RandomForestRegression() class here to perform the regression.
About the data set
The dataset I have used here for demonstration purpose is downloaded from https://www.kaggle.com. The dataset contains the height and weight of persons and a column with their genders. The original dataset has more than thousands of rows, but for this regression purpose, I have used only the first 50 rows containing data on 25 male and 25 females.
So, let’s jump to the most fun part of the article, that is coding with python:
The first step to start coding is to import all the libraries we are going to use. The basic libraries for any kind of data science projects are like pandas, numpy, matplotlib etc. The purpose of these libraries are discussed before in the article simple linear regression with python.
# importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Reading the dataset
I have already mentioned about the dataset used here for demonstration purpose. The below code is to import the data and store in a dataframe called dataset.
Here is a glimpse of the dataset
As we can see that the dataframe contains three variables in three columns. We are interested in only the last two columns. We want to regress the weight of a person using the height of him/her. So, here the independent variable height is stored in x and the dependent variable weight is stored in y.
Fitting random forest regression
The below code used the RandomForestRegression() class of sklearn to regress weight using height. As the fit is ready, I have used it to create some prediction with some unknown values not used in the fitting process. The predicted weight of a person with height 45.8 is 100.50
# Application of random forest regression
from sklearn.ensemble import RandomForestRegressor # this is the required algorithm for the task
regressor = RandomForestRegressor(n_estimators = 100, random_state = 0)
# fitting the random forest regression with the data
#predicting the output
Y_pred = regressor.predict(np.array([45.8]).reshape(1, 1))
Creating a fit plot with the predicted values
The following code is to visualize the prediction result against the original values. This is a way through which we can visualize how good the regression is performing.
# Creating a plot with the predicted result
X_grid = np.arange(min(x), max(x), 0.01)
# Making the one dimensional X_grid a two dimensional variable
X_grid = X_grid.reshape((len(X_grid), 1))
# Create a scatter plot with the original variables
plt.scatter(x, y, color = 'blue')
# Creating a line with the predicted data
color = 'blue')
plt.title('Random Forest Regression')
So, here is the regression fit plot.
Application of random forest for classification using Python
So, we learned about random forest regression and how we can implement it with python. Now it is time to implement random forest classification. The same sci-kit learn library we used for regression also has a very efficient algorithm for performing this classification process. Here we will apply the RandomForestClassification() function of this library.
So, let’s start coding to perform classification using random forest algorithm.
About the data set
The data set used here is the very famous iris data set of Sir Ronald A. Fisher regarded as the Father of statistics for his remarkable contribution. It is very much popular multivariate dataset and since long has been used as an example data set for any kind of pattern recognition problem.
The data set contains information on 3 species of iris plant with 50 instances about each species. All the three classes are linearly separable from each other. The dependent variable here is the species of iris plant and the three independent variables are sepal length, sepal width, petal length and petal width measured in cm.
The idea behind the data set is that the particular species of any iris plant can be identified with these four variables determining the flower characteristics. Here also we are going to use this random forest classification algorithm to classify the data. And thereafter using that fitted classification model to predict the species of an unknown iris plant using the independent variables.
So, lets start coding…
The first step to start coding is to import all the libraries we are going to use. The basic libraries for any kind of data science projects are like pandas, numpy, matplotlib etc. and with them sklearn library for the random forest classification algorithm.
Know the functions of all these libraries here.
# importing libraries
import pandas as pd # for dataframe operations
import numpy as np # for matrix operations
from sklearn.model_selection import train_test_split # for splitting the dataset for training and testing dataset
from sklearn import datasets #importing the sklearn library for the iris dataset
from sklearn.ensemble import RandomForestClassifier # for applying random forest classification
Loading the dataset
The iris dataset being a popularly used example dataset is already provided with sklearn library. We need to load the dataset in our workspace before we are going to use it. I am storing the dataset with the name dataset.
# loading the iris dataset
Now to check the dataset we need to check the target and features i.e. the dependent and independent variable classes of the data. Here we will print these information to check them.
print(dataset.target_names) #printing the target names
print(dataset.feature_names)#printing the feature names
Storing the data into a dataframe
The data is loaded into workspace but until it is in the form of a dataframe we can not apply other data analysis functions. So here lets store the data into a dataframe named test.
# creating a dataframe from the dataset
Below is a view of few rows of the newly created dataframe of dimension 150X5.
Crating dependent and independent variables
To apply classification algorithm, first of all we need the dependent and independent variables. So here we will store these variables fetching data from dataset.
Now as we have created two variables x and y storing independent and dependent values respectively, we need to split them. This splitting is to create training and testing dataset with a proportion of 80% and 20% of the total data respectively.
# Dividing the data for training and testing
x=test[['sepal length','sepal width', 'petal width']]
x_train, x_test,y_train, y_test=train_test_split(x,y,test_size=0.2, random_state=0)
Application of Random Forest Classification
The below code does the main task of classifying the data using the RandomForestClassifier() of sklearn library. Then a variable pred is created to store the predicted values applying the classification fit on the test dataset.
# applying RandomForest classification algorithm
Checking the accuracy of the classification fit
The sklearn library also has a function called accuracy_score() which tells how accurate the classification is. Here the accuracy value we get is 0.93, which is quite satisfactory.
# testing the accuracy of the result
from sklearn import metrics