How to create your first machine learning project: a comprehensive guide

Your first machine learning project

This article is to help you to start with your first machine learning project. Machine learning projects are very important if you are serious about your career as a data scientist. You need to build your profile with a number of machine learning projects. These projects are evidence of your proficiency and skill in this field.

The projects are not necessarily only complex problems. They can be very basic with simple problems. What is important is to complete them. Ideally, in the beginning, you should take a small project and finish it. It will boost your confidence as you have successfully completed it as well as you will get to learn many new things.

So, to start with I have also selected a very basic problem which is the classification of Iris data set. You can compare it with the very basic “Hello world” program that every programmer writes as a beginner. The data set is small that’s why easy to load in your computer; consists of a few no. of features only so implementation of any ML algorithm is easier.

I have used here Google Colab to execute the Python code. You can try any IDE you generally use. Feel free to copy the code given here and execute them. The first step is to use the existing code without any error. Afterwards, make little changes to see how the output gets affected or gives errors. This is the most effective way to know a new language as well as its application in Machine Learning.

The steps for first machine learning project

So, without much ado, lets jump to the project. You first need to chalk out the steps of implementing the project.

  • Importing the python libraries
  • Importing and loading the data set
  • Exploring the data set to have a preliminary idea about the variables
  • Identifying the target and feature variables and the independent-dependent relationship between them
  • Creating training and testing data set
  • Model building and fitting
  • Testing the data set
  • Checking model performance with comparison metrics

This is an ideal sequence how you should proceed with the project. As you gain experience you will not have to remember them. Being the first machine learning project I felt it necessary to mention them for further reference.

Importing the required libraries

# Importing required libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
import numpy as np

About the data

The data is collected from UCI machine learning repository, Iris data set and created by Dr R. A. Fisher. It contains three Iris species viz. “Setosa”, “Versicolor” and “Virginica” and four flower feature namely petal length, petal width, sepal length and sepal width in cm. Each of the species represents a class and has 50 samples each in the data set. So the Iris data has total 150 samples.

This is the most popular and basic data used in pattern recognition to date. The data source is UCI machine learning repository and it is a little different from the same Iris data set found in R.

The following line of code will load the data set in your working environment.

# Loading the data set
dataset = load_iris()

The following code will generate a detail description of the data set.

# Printing some data features
dataset.DESCR

Description of Iris data

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

    ============== ==== ==== ======= ===== ====================
                    Min  Max   Mean    SD   Class Correlation
    ============== ==== ==== ======= ===== ====================
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)
    ============== ==== ==== ======= ===== ====================

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.

This is perhaps the best known database to be found in the
pattern recognition literature.  Fisher's paper is a classic in the field and
is referenced frequently to this day.  (See Duda & Hart, for example.)  The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant.  One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

.. topic:: References

   - Fisher, R.A. "The use of multiple measurements in taxonomic problems"
     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
     Mathematical Statistics" (John Wiley, NY, 1950).
   - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
     Structure and Classification Rule for Recognition in Partially Exposed
     Environments".  IEEE Transactions on Pattern Analysis and Machine
     Intelligence, Vol. PAMI-2, No. 1, 67-71.
   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
     on Information Theory, May 1972, 431-433.
   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
     conceptual clustering system finds 3 classes in the data.
   - Many, many more ...

Checking the data type

We can check the data type before proceeding for analytical steps. Use the following code for checking the data type:

# Checking the data type 
print(type(dataset))

Now here is a problem with the data type. Check the output below, it says it is a sklearn data.

Data type

Although the most common data type we are used to is Pnadas dataframe. And also the target and feature are stored here separately. You can print them separately using the following lines.

# Printing the components of Iris data
print(dataset.target_names)
print(dataset.target)
print(dataset.feature_names)

See the print output below. The target variables are the three Iris species “Setosa”, “Versicolor” and “Virginica” which are coded as 0,1 and 2 respectively. And the features are also stored separately.

First machine learning project: Components of Iris data set
Components of Iris data set

And the feature values are stored separately as data. Here is first few rows of the data.

# Printing the feature data
print(dataset.data)
First machine learning project data set view

Converting the data type

For the ease of further modelling process, we need to convert the data type from sklearn to the most common Pandas data type. And we also need to concatenate the separate data and target with column names as feature_names and target. The np.c_ function concatenates the data set.

# Converting scikit learn dataset to a pandas dataframe
import pandas as pd
df = pd.DataFrame(data= np.c_[dataset['data'], dataset['target']],columns= dataset['feature_names'] + ['target'])
df.head()

See below few lines of the combined dataframe. With this new dataframe we are now ready to proceed for the next step.

Panda data frame for your First machine learning project
The new Panda dataframe

Check the shape of the newly created dataframe as I have done below. The output confirms that the dataframe is now complete with 150 samples and 5 columns.

# Printing the shape of the newly created dataframe
print(df.shape)

Creating target and feature variables

Next, we need to create variables storing the dependent and independent variables. Here the target variable Iris species is dependent on the feature_variables so the flower properties i.e. petal width, petal length, sepal length and sepal width are independent variables.

The data set printed above, you can see that the first four columns are independent variables and the last one has the dependent variable. So, in the below line of codes, variable x is to store the values of first four columns and y for the target variable.

# Creating target and feature variables
x=df.iloc[:,0:4].values
y=df.iloc[:,4].values
print(x.shape)
print(y.shape)

The shape of x and y is as below.

Shape of x and y

Splitting the data set

We need to split the data set before applying Machine learning algorithms. The train_test_split() function of sklearn has been used here to do the task. The test data size is set as 20% of the data.

# Splitting the data set into train and test set
x_train, x_test, y_train, y_test=train_test_split(x,y,test_size=0.2,random_state=0)
print(x_train.shape)
print(x_test.shape)

Accordingly, the train data set contains 120 sample data whereas the test data set has 30 sample data.

Application of Decision tree algorithm

So, we have finished with data processing steps and ready to apply the Machine Learning algorithm. I have chosen here a very popular classification algorithm which is Decision Tree algorithm for the first machine learning project.

If this algorithm is new to you, you can refer to this article to learn details about it and how it can be applied with Python. The speciality of this ML algorithm is that its logic is very simple and the process is not black box like most other ML algorithms. Which means that we can see and understand how the decision-making process is going on.

So let’s apply this ML model to the training set of Iris data. The DecisionTreeClassifier() of sklearn is the function here which we have imported in the beginning.

# Application of Decision Tree classification algorithm
dt=DecisionTreeClassifier()
# Fitting the dt model
dt.fit(x_train, y_train)

The model thus applied on the training set. In the below screenshot of my Colab notebook you can see the classifier has several parameters specifying the decision tree formation. At this stage you don’t need to bother about all these specifications. We can discuss each of them and what is their function in another article.

DecisionTreeClassification model fit:First machine learning project
Fitting the Decision Tree Classification model

Prediction using the trained model

To test the model we will first create a new data. As this data has not been used in model building so the prediction will not be biased.

# Creating a new feature set i.e. a new flower properties
x_new = np.array([[4.9,	3.0,	1.4,	0.2]])
# Predicting for the new data using the trained model
prediction = knn.predict(x_new)
print("Prediction:",prediction)

See the prediction result using the trained Decision Tree classifier. It gives the result as 0 which represents the iris species “Setosa”. We have discussed before the Iris species are represented in the data frame with digits 0,1 and 2.

Prediction for the new data

Lets try to predict the result using the test set with 20% of data kept independent while model training. We will also use two metrics suggesting the goodness of fit of the model.

y_pred = dt.predict(x_test)
print("Predictions for the test set:",y_pred)
# Metrics for goodness of fit 
print("np.mean: ",np.mean  (y_pred == y_test))
print("dt.score:", dt.score(x_test, y_test))

And the output of the above piece of code is as below.

Prediction using the test set

You can see that the testing accuracy score is 1.0!. So, it is indicating a problem. The problem of overfitting. Which is very common with Decision Tree Classification. Overfitting suggests that the model is a too good fit for this particular data set. Which is not desirable. And ideally, we should try other machine learning models to check their performance.

So in this section next we will not take up a single ML algorithm, rather we will take up a bunch of ML algorithms and test their performance side by side to choose the best performing one.

Application of more than one ML models simultaneously

In this section, we will fit multiple ML algorithms at a time to classify the Iris data and see which one of them is the most accurate. The ML algorithms we will use here are Linear Discriminant Analysis, Naive Bayes classifier, Logistic regression, Support Vector Machine, K-Nearest Neighbour classifier and also Decision tree classifier which we have already applied before. Here I am including it too just to compare it with the others.

Along with these ML models another segment which I am going to introduce is known as Ensemble models. The specialty of this method is that an ensemble model uses more than one machine learning models at a time to achieve more accurate estimation. See the below figure to understand the process.

A schematic diagram of ensemble method
An ensemble model

Now there are two kinds of ensemble models which are Bagging and Boosting. I have incorporated both kinds of ensemble models here to compare them with other machine learning algorithms. Here is a brief idea about Bagging and Boosting ensemble techniques.

Bagging

The name is actually Bootstrap Aggregation. It is essentially a random sampling technique with replacement. That means here once a sample unit is selected, it is again replaced back for further future selection. This method works best with algorithms which tend to have higher variance and bias, like decision tree algorithm.

Bagging method runs a different model separately and for the final prediction output aggregates each model’s estimation without any bias to any model.

The other ensemble modelling technique is:

Boosting

As an ensemble learning method, boosting also comprises a number of modelling algorithm for prediction. It associates weight to make a weak learning algorithm stronger and thus improving the prediction. The learning algorithms also learn from each other to boost the overall model performance.

The ensemble models we are going to use here are AdaBoostClassifier(), BaggingClassifier(), ExtraTreesClassifier(), GradientBoostingClassifier() and RandomForestClassifier(). All are from sklearn library.

Importing required libraries

# Importing libraries
from sklearn.model_selection import cross_val_score
from sklearn import ensemble
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
import matplotlib.pyplot as plt
import seaborn as sns

Application of all the models

Use this following lines of code to build, train and execute all the six models. It also consists of an array with name ml_compare[]. It stores all the comparison matrices calculated here.

# Application of all the ML algorithms at a time
ml = []
ml.append(('LDA', LinearDiscriminantAnalysis())),
ml.append(('DTC', DecisionTreeClassifier())),
ml.append(('GNB', GaussianNB())),
ml.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr'))),
ml.append(('SVM', SVC(gamma='auto'))),
ml.append(('KNN', KNeighborsClassifier())),
ml.append(("Ensemble_AdaBoost", ensemble.AdaBoostClassifier()))
ml.append(("Ensemble_Bagging", ensemble.BaggingClassifier()))
ml.append(("Ensemble_Extratree", ensemble.ExtraTreesClassifier()))
ml.append(("Ensemble_GradientBoosting", ensemble.GradientBoostingClassifier()))
ml.append(("Ensemble_RandomForest", ensemble.RandomForestClassifier()))

ml_cols=[]
ml_compare=pd.DataFrame(columns=ml_cols)
row_index=0
# Model evaluation
for name, model in ml:
  model.fit(x_train,y_train)
  predicted=model.predict(x_test)
  kfold = StratifiedKFold(n_splits=10, random_state=1, shuffle=True)
  cv_results = cross_val_score(model, x_train, y_train, cv=kfold, scoring='accuracy')
  ml_compare.loc[row_index, 'Model used']=name
  ml_compare.loc[row_index,"Cross Validation Score" ]=round(cv_results.mean(),4)
  ml_compare.loc[row_index,"Cross Value SD" ]=round(cv_results.std(),4)
  ml_compare.loc[row_index,'Train Accuracy'] = round(model.score(x_train, y_train), 4)
  ml_compare.loc[row_index,"Test accuracy" ]=round(model.score(x_test, y_test),4)

  row_index+=1

ml_compare

As all the models get trained and executed with the train set, they are simultaneously tested with the test data. The goodness of fit statistics gets stored in ml_compare[]. So, let’s see now what ml_compare[] tells us. The output is as below.

Comparative table of cross validation score of all the models

Visual comparison of the models

Although from the above table the models can be compared, it is always easier if there is a way to visualize the difference. So, let’s create a bar chart using the cross-validation score. we have calculated above. Use the following line of codes to create the bar chart with the help of matplotlib and seaborn module of sklearn.

# Creating plot to show the train accuracy
plt.subplots(figsize=(13,5))
sns.barplot(x="Model used", y="Train Accuracy",data=ml_compare,palette='hot',edgecolor=sns.color_palette('dark',7))
plt.xticks(rotation=90)
plt.title('Model Train Accuracy Comparison')
plt.show()

As the above code executes, the following bar chart is created showing the cross-validation scores of all the ML algorithms.

The verdict

So, we have classified the Iris data using different types of Machine Learning and ensemble models. And the result shows that they all are more or less accurate in identifying the Iris species correctly. But if still, we need to pick any one of them as the best, then we can do that based on the above comparative table as well as the graph.

For this instance, we have Linear Discriminant and Support Vector Machine performing slightly better than the others. But it can vary depending on the size of data and ML scores do change in different executions. You also check your result, which one you have found best and let me know through comments below.

Conclusion

So, congratulations you have successfully completed your very firs machine learning project with python. You have used a popular and classic data set to apply several machine learning algorithms. The data being a multiclass data set is an ideal example of real world classification problem.

To find out the best performing model, we have applied the six most popular Machine Learning algorithms along with several ensemble models. To start with the model building process, first of all, the data set has been divided into training and testing sets.

The training set is to build and train the model. The test data set is an independent data set kept aside while building the model, to test the model’s performance. This is an empirical process of model validation when independent data collection is not possible. For this project, we have taken an 80:20 ratio for train and test data set.

And at the last a no. of comparison metrics were used to find the model with the highest accuracy. These are essentially the ideal steps of any machine learning project. As it is your first machine learning project experience, so I have showed every step with all details. As you advance in experience you may skip some of them as per your convenience.

So, please let me know your experience with the article. Any problem you faced while executing the code or any other queries post them in the comment section below, I will love to answer them.

References: