This article is to help you to start with your first machine learning project. Machine learning projects are very important if you are serious about your career as a data scientist. You need to build your profile with a number of machine learning projects. These projects are evidence of your proficiency and skill in this field.
The projects are not necessarily only complex problems. They can be very basic with simple problems. What is important is to complete them. Ideally, in the beginning, you should take a small project and finish it. It will boost your confidence as you have successfully completed it as well as you will get to learn many new things.
So, to start with I have also selected a very basic problem which is the classification of Iris data set. You can compare it with the very basic “Hello world” program that every programmer writes as a beginner. The data set is small that’s why easy to load in your computer; consists of a few no. of features only so implementation of any ML algorithm is easier.
I have used here Google Colab to execute the Python code. You can try any IDE you generally use. Feel free to copy the code given here and execute them. The first step is to use the existing code without any error. Afterwards, make little changes to see how the output gets affected or gives errors. This is the most effective way to know a new language as well as its application in Machine Learning.
The steps for first machine learning project
So, without much ado, lets jump to the project. You first need to chalk out the steps of implementing the project.
- Importing the python libraries
- Importing and loading the data set
- Exploring the data set to have a preliminary idea about the variables
- Identifying the target and feature variables and the independent-dependent relationship between them
- Creating training and testing data set
- Model building and fitting
- Testing the data set
- Checking model performance with comparison metrics
This is an ideal sequence how you should proceed with the project. As you gain experience you will not have to remember them. Being the first machine learning project I felt it necessary to mention them for further reference.
Importing the required libraries
# Importing required libraries from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier import numpy as np
About the data
The data is collected from UCI machine learning repository, Iris data set and created by Dr R. A. Fisher. It contains three Iris species viz. “Setosa”, “Versicolor” and “Virginica” and four flower feature namely petal length, petal width, sepal length and sepal width in cm. Each of the species represents a class and has 50 samples each in the data set. So the Iris data has total 150 samples.
This is the most popular and basic data used in pattern recognition to date. The data source is UCI machine learning repository and it is a little different from the same Iris data set found in R.
The following line of code will load the data set in your working environment.
# Loading the data set dataset = load_iris()
The following code will generate a detail description of the data set.
# Printing some data features dataset.DESCR
Description of Iris data
.. _iris_dataset: Iris plants dataset -------------------- **Data Set Characteristics:** :Number of Instances: 150 (50 in each of three classes) :Number of Attributes: 4 numeric, predictive attributes and the class :Attribute Information: - sepal length in cm - sepal width in cm - petal length in cm - petal width in cm - class: - Iris-Setosa - Iris-Versicolour - Iris-Virginica :Summary Statistics: ============== ==== ==== ======= ===== ==================== Min Max Mean SD Class Correlation ============== ==== ==== ======= ===== ==================== sepal length: 4.3 7.9 5.84 0.83 0.7826 sepal width: 2.0 4.4 3.05 0.43 -0.4194 petal length: 1.0 6.9 3.76 1.76 0.9490 (high!) petal width: 0.1 2.5 1.20 0.76 0.9565 (high!) ============== ==== ==== ======= ===== ==================== :Missing Attribute Values: None :Class Distribution: 33.3% for each of 3 classes. :Creator: R.A. Fisher :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov) :Date: July, 1988 The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken from Fisher's paper. Note that it's the same as in R, but not as in the UCI Machine Learning Repository, which has two wrong data points. This is perhaps the best known database to be found in the pattern recognition literature. Fisher's paper is a classic in the field and is referenced frequently to this day. (See Duda & Hart, for example.) The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other. .. topic:: References - Fisher, R.A. "The use of multiple measurements in taxonomic problems" Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to Mathematical Statistics" (John Wiley, NY, 1950). - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis. (Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218. - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System Structure and Classification Rule for Recognition in Partially Exposed Environments". IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. PAMI-2, No. 1, 67-71. - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions on Information Theory, May 1972, 431-433. - See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II conceptual clustering system finds 3 classes in the data. - Many, many more ...
Checking the data type
We can check the data type before proceeding for analytical steps. Use the following code for checking the data type:
# Checking the data type print(type(dataset))
Now here is a problem with the data type. Check the output below, it says it is a sklearn data.
Although the most common data type we are used to is Pnadas dataframe. And also the target and feature are stored here separately. You can print them separately using the following lines.
# Printing the components of Iris data print(dataset.target_names) print(dataset.target) print(dataset.feature_names)
See the print output below. The target variables are the three Iris species “Setosa”, “Versicolor” and “Virginica” which are coded as 0,1 and 2 respectively. And the features are also stored separately.
And the feature values are stored separately as data. Here is first few rows of the data.
# Printing the feature data print(dataset.data)
Converting the data type
For the ease of further modelling process, we need to convert the data type from sklearn to the most common Pandas data type. And we also need to concatenate the separate data and target with column names as feature_names and target. The np.c_ function concatenates the data set.
# Converting scikit learn dataset to a pandas dataframe import pandas as pd df = pd.DataFrame(data= np.c_[dataset['data'], dataset['target']],columns= dataset['feature_names'] + ['target']) df.head()
See below few lines of the combined dataframe. With this new dataframe we are now ready to proceed for the next step.
Check the shape of the newly created dataframe as I have done below. The output confirms that the dataframe is now complete with 150 samples and 5 columns.
# Printing the shape of the newly created dataframe print(df.shape)
Creating target and feature variables
Next, we need to create variables storing the dependent and independent variables. Here the target variable Iris species is dependent on the feature_variables so the flower properties i.e. petal width, petal length, sepal length and sepal width are independent variables.
The data set printed above, you can see that the first four columns are independent variables and the last one has the dependent variable. So, in the below line of codes, variable x is to store the values of first four columns and y for the target variable.
# Creating target and feature variables x=df.iloc[:,0:4].values y=df.iloc[:,4].values print(x.shape) print(y.shape)
The shape of x and y is as below.
Splitting the data set
We need to split the data set before applying Machine learning algorithms. The train_test_split() function of sklearn has been used here to do the task. The test data size is set as 20% of the data.
# Splitting the data set into train and test set x_train, x_test, y_train, y_test=train_test_split(x,y,test_size=0.2,random_state=0) print(x_train.shape) print(x_test.shape)
Accordingly, the train data set contains 120 sample data whereas the test data set has 30 sample data.
Application of Decision tree algorithm
So, we have finished with data processing steps and ready to apply the Machine Learning algorithm. I have chosen here a very popular classification algorithm which is Decision Tree algorithm for the first machine learning project.
If this algorithm is new to you, you can refer to this article to learn details about it and how it can be applied with Python. The speciality of this ML algorithm is that its logic is very simple and the process is not black box like most other ML algorithms. Which means that we can see and understand how the decision-making process is going on.
So let’s apply this ML model to the training set of Iris data. The DecisionTreeClassifier() of sklearn is the function here which we have imported in the beginning.
# Application of Decision Tree classification algorithm dt=DecisionTreeClassifier() # Fitting the dt model dt.fit(x_train, y_train)
The model thus applied on the training set. In the below screenshot of my Colab notebook you can see the classifier has several parameters specifying the decision tree formation. At this stage you don’t need to bother about all these specifications. We can discuss each of them and what is their function in another article.
Prediction using the trained model
To test the model we will first create a new data. As this data has not been used in model building so the prediction will not be biased.
# Creating a new feature set i.e. a new flower properties x_new = np.array([[4.9, 3.0, 1.4, 0.2]]) # Predicting for the new data using the trained model prediction = knn.predict(x_new) print("Prediction:",prediction)
See the prediction result using the trained Decision Tree classifier. It gives the result as 0 which represents the iris species “Setosa”. We have discussed before the Iris species are represented in the data frame with digits 0,1 and 2.
Lets try to predict the result using the test set with 20% of data kept independent while model training. We will also use two metrics suggesting the goodness of fit of the model.
y_pred = dt.predict(x_test) print("Predictions for the test set:",y_pred) # Metrics for goodness of fit print("np.mean: ",np.mean (y_pred == y_test)) print("dt.score:", dt.score(x_test, y_test))
And the output of the above piece of code is as below.
You can see that the testing accuracy score is 1.0!. So, it is indicating a problem. The problem of overfitting. Which is very common with Decision Tree Classification. Overfitting suggests that the model is a too good fit for this particular data set. Which is not desirable. And ideally, we should try other machine learning models to check their performance.
So in this section next we will not take up a single ML algorithm, rather we will take up a bunch of ML algorithms and test their performance side by side to choose the best performing one.
Application of more than one ML models simultaneously
In this section, we will fit multiple ML algorithms at a time to classify the Iris data and see which one of them is the most accurate. The ML algorithms we will use here are Linear Discriminant Analysis, Naive Bayes classifier, Logistic regression, Support Vector Machine, K-Nearest Neighbour classifier and also Decision tree classifier which we have already applied before. Here I am including it too just to compare it with the others.
Along with these ML models another segment which I am going to introduce is known as Ensemble models. The specialty of this method is that an ensemble model uses more than one machine learning models at a time to achieve more accurate estimation. See the below figure to understand the process.
Now there are two kinds of ensemble models which are Bagging and Boosting. I have incorporated both kinds of ensemble models here to compare them with other machine learning algorithms. Here is a brief idea about Bagging and Boosting ensemble techniques.
The name is actually Bootstrap Aggregation. It is essentially a random sampling technique with replacement. That means here once a sample unit is selected, it is again replaced back for further future selection. This method works best with algorithms which tend to have higher variance and bias, like decision tree algorithm.
Bagging method runs a different model separately and for the final prediction output aggregates each model’s estimation without any bias to any model.
The other ensemble modelling technique is:
As an ensemble learning method, boosting also comprises a number of modelling algorithm for prediction. It associates weight to make a weak learning algorithm stronger and thus improving the prediction. The learning algorithms also learn from each other to boost the overall model performance.
The ensemble models we are going to use here are AdaBoostClassifier(), BaggingClassifier(), ExtraTreesClassifier(), GradientBoostingClassifier() and RandomForestClassifier(). All are from sklearn library.
Importing required libraries
# Importing libraries from sklearn.model_selection import cross_val_score from sklearn import ensemble from sklearn.model_selection import StratifiedKFold from sklearn.metrics import classification_report from sklearn.metrics import accuracy_score from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn.naive_bayes import GaussianNB from sklearn.svm import SVC import matplotlib.pyplot as plt import seaborn as sns
Application of all the models
Use this following lines of code to build, train and execute all the six models. It also consists of an array with name ml_compare. It stores all the comparison matrices calculated here.
# Application of all the ML algorithms at a time ml =  ml.append(('LDA', LinearDiscriminantAnalysis())), ml.append(('DTC', DecisionTreeClassifier())), ml.append(('GNB', GaussianNB())), ml.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr'))), ml.append(('SVM', SVC(gamma='auto'))), ml.append(('KNN', KNeighborsClassifier())), ml.append(("Ensemble_AdaBoost", ensemble.AdaBoostClassifier())) ml.append(("Ensemble_Bagging", ensemble.BaggingClassifier())) ml.append(("Ensemble_Extratree", ensemble.ExtraTreesClassifier())) ml.append(("Ensemble_GradientBoosting", ensemble.GradientBoostingClassifier())) ml.append(("Ensemble_RandomForest", ensemble.RandomForestClassifier())) ml_cols= ml_compare=pd.DataFrame(columns=ml_cols) row_index=0 # Model evaluation for name, model in ml: model.fit(x_train,y_train) predicted=model.predict(x_test) kfold = StratifiedKFold(n_splits=10, random_state=1, shuffle=True) cv_results = cross_val_score(model, x_train, y_train, cv=kfold, scoring='accuracy') ml_compare.loc[row_index, 'Model used']=name ml_compare.loc[row_index,"Cross Validation Score" ]=round(cv_results.mean(),4) ml_compare.loc[row_index,"Cross Value SD" ]=round(cv_results.std(),4) ml_compare.loc[row_index,'Train Accuracy'] = round(model.score(x_train, y_train), 4) ml_compare.loc[row_index,"Test accuracy" ]=round(model.score(x_test, y_test),4) row_index+=1 ml_compare
As all the models get trained and executed with the train set, they are simultaneously tested with the test data. The goodness of fit statistics gets stored in ml_compare. So, let’s see now what ml_compare tells us. The output is as below.
Visual comparison of the models
Although from the above table the models can be compared, it is always easier if there is a way to visualize the difference. So, let’s create a bar chart using the cross-validation score. we have calculated above. Use the following line of codes to create the bar chart with the help of matplotlib and seaborn module of sklearn.
# Creating plot to show the train accuracy plt.subplots(figsize=(13,5)) sns.barplot(x="Model used", y="Train Accuracy",data=ml_compare,palette='hot',edgecolor=sns.color_palette('dark',7)) plt.xticks(rotation=90) plt.title('Model Train Accuracy Comparison') plt.show()
As the above code executes, the following bar chart is created showing the cross-validation scores of all the ML algorithms.
So, we have classified the Iris data using different types of Machine Learning and ensemble models. And the result shows that they all are more or less accurate in identifying the Iris species correctly. But if still, we need to pick any one of them as the best, then we can do that based on the above comparative table as well as the graph.
For this instance, we have Linear Discriminant and Support Vector Machine performing slightly better than the others. But it can vary depending on the size of data and ML scores do change in different executions. You also check your result, which one you have found best and let me know through comments below.
So, congratulations you have successfully completed your very firs machine learning project with python. You have used a popular and classic data set to apply several machine learning algorithms. The data being a multiclass data set is an ideal example of real world classification problem.
To find out the best performing model, we have applied the six most popular Machine Learning algorithms along with several ensemble models. To start with the model building process, first of all, the data set has been divided into training and testing sets.
The training set is to build and train the model. The test data set is an independent data set kept aside while building the model, to test the model’s performance. This is an empirical process of model validation when independent data collection is not possible. For this project, we have taken an 80:20 ratio for train and test data set.
And at the last a no. of comparison metrics were used to find the model with the highest accuracy. These are essentially the ideal steps of any machine learning project. As it is your first machine learning project experience, so I have showed every step with all details. As you advance in experience you may skip some of them as per your convenience.
So, please let me know your experience with the article. Any problem you faced while executing the code or any other queries post them in the comment section below, I will love to answer them.