Machine Learning is learning from data and identifying the trend, pattern and insights from data in hand. It is gaining very fast popularity across the industry sectors. Machine Learning skill is an essential one nowadays to analyze a large amount of data to come up with some robust model thus enabling quick decision making and policy optimization.
This article discussed two very easy fixes for this problem faced by almost all Jupyter notebook users while doing data science projects. I have faced this issue myself while working folder of Jupyter notebook, the most preferred IDE of data scientists.
Although at the start it did not seem a big problem, as you start using Jupyter on a daily basis, you want it should start from your directory of choice. It helps you being organized, all your data science files at one place.
While I searched the internet thoroughly and got many suggestions, very few of them were really helpful. And it took quite a lot of my time to figure out the process which is really helpful. I thought to write it down as a blog so that in future I don’t have to waste time again to fix the issue and so my readers.
So, without any further ado, lets jump to the solutions…
NB: Being a non-native English speaker, I always take extra care to proofread my articles with Grammarly. It is the best grammar and spellchecker available online. Read here my review of using Grammarly for more than two years.
The first and the quickest solution is to run your Jupyter notebook right from the Anaconda PowerShell. You need to just change the directory to the desired one there and run Jupyter notebook. It is that simple. See the below image
Here you can see that the default working folder of Jupyter notebook was c:\user\Dibyendu as in the PowerShell. I have changed the directory to E: and simply run the command jupyter notebook. Consequently, PowerShell has run the Jupyter notebook with the start folder as mentioned.
This is very effective and changes the start folder for jupyter notebook very easily. But the problem is that this change is temporary and you have to go through this process every time you open the notebook.
To fix this problem one solution can be to create a batch file with these commands and just run this batch file while you need to work in jupyter notebook.
Creating shortcut with target as the working folder of Jupyter notebook
This solution is my favourite and I personally follow this procedure. Here the steps are explained with screenshots from my system.
You need to first locate the jupyter notebook app in your computer by right clicking the application in your menu as shown in the below image.
Now navigate to the file location and select the application file like the below image. Copy the file in your desktop or any location you want a shortcut of the application.
Now right-click the application and go to the shortcut tab. The target file you can see here is mentioned as “%USERPROFILE%”, which is indeed the default installation folder for jupyter notebook. That’s why it is the default start folder for the notebook.
Now you need to replace the “%USERPROFILE%” part with the exact location of your desired directory.
In the above image you can see that I have replaced the “%USERPROFILE%” with the data science folder which contains all of my data science projects. Now just click Apply and then OK. Now to open jupyter notebook click the shortcut and jupyter will open with your mentioned directory as the start folder as in the below image.
So, the problem is solved. You can use this trick and create multiple shortcuts with different folders as the start folder of jupyter notebook.
Web scraping, also known as web harvesting or screen scraping or web data extraction is a way of collecting a large amount of data from the internet. In data science, especially in machine learning, the accuracy of the model is largely dependent on the amount of data you have. A large amount of data helps to train the model and make it more perfect.
Across all business domains, data plays a crucial role to decide strategies, competitor price monitoring, consumer sentiment analysis, extracting financial statements etc. Be it a small business owner or business tycoons, the market data and analytics is something they always need to keep a tab to survive the cutthroat competition. Every single decision they take towards business expansion is driven by the data.
Web scraped data collected from diverse sources enables us to get real-time analytics i.e. data gets analyzed right after the data is available. There are instances where a delayed data analysis report has no use. For example, stock price data analysis needs to be real-time for trading. Customer Relationship Management (CRM) is also an example of real-time data analytics.
Source of data
So, what is the source of such a large amount of data? Obviously the internet. There are a lot of open-source data and also web sites catering to specialised data. Generally, we visit such sites one at a time and search for the information we look for. We put our query and the required information is fetched from the server.
This process is okay until and unless we need data for a data science project. The amount of data required for a satisfactory machine learning model is huge and a single website can not cater much.
Data science projects involve projects like Natural Language Processing (NLP), Image recognition etc. which has revolutionalized artificial intelligence application towards solving our day to day needs and even critical path-breaking scientific achievements. In these cases, web scraping is the most favourite and frequently used tool by data scientists.
Web scraping in data science can be defined as the construction of a computer programme which automatically downloads, parse, organize data from the internet (source: https://www.kdnuggets.com).
Points to remember before you go for web scraping in data science
Now before you go for data scraping from any website, you must double-check if the site allows web scaping. If the website is open-source or categorically mentions that it caters data for private use then no issue. Otherwise, you can check for the robots.txt file of the site. Sometimes the site clearly mentions if they have issues with web scaping from their site.
For example, see the robots.txt file of Facebook. You can check it navigating to the link https://www.facebook.com/robots.txt. There you can see a few lines at the very beginning of the file which categorically mentions “collection of data on Facebook through automated means is prohibited unless you have express written permission from Facebook”.
So, checking the robots.txt is also an effective way of checking if data scraping is at all allowed by the website you want to scrape.
Web scraping can be accomplished both using web APIs or tools like BeautifulSoup. BeautifulSoup is a class specially made for web scraping and available with bs4 package. It is a very helpful package and saves programmers a lot of time. It helps to collect data from HTML and XML files.
Let’s try a very basic web scraping code using the BeautifulSoup class of bs4 package of Python.
A practical example of data scraping
Lets take a practical example where we can scrap a data table from a webpage. We will take the URL of this page itself and try to scrape the below table. It is an imaginary example table containing the age, height and weight data of 20 random persons.
Name
Gender
Age
Height
Weight
Ramesh
Male
18
5.6
59
Dinesh
Male
23
5.0
55
Sam
Male
22
5.5
54
Dipak
Male
15
4.5
49
Rahul
Male
18
5.9
60
Rohit
Male
20
6.0
69
Debesh
Male
25
6.1
70
Deb
Male
21
5.9
56
Debarati
Female
29
5.4
54
Dipankar
Male
22
5.7
56
Smita
Female
25
5.5
60
Dilip
Male
30
4.9
49
Amit
Male
14
4.8
43
Mukesh
Male
26
5.1
50
Aasha
Female
27
4.7
51
Dibakar
Male
22
5.3
55
Manoj
Male
33
6.2
75
Vinod
Male
27
5.2
54
Jyoti
Female
22
5.9
65
An example table for data scraping
Suppose this table we want to use in our data science project. So, how we can bring the data in a usable format. This table is just an example, and usually, you will find tables with thousands of rows and the number of web pages with such tables. But the process of scraping data will be the same.
Let’s try to scrape this small table using the bs4 library of Python. It stands for BeautifulSoup version 4. The bs class defines the basic interface called by tree builders.
Importing required libraries
The two special libraries we will need here are BeautifulSoup and requests for scraping information and grabbing the URL content.
# Importing required libraries
import requests
from bs4 import BeautifulSoup as bs # defines the basic interface called by the tree builders
In this section we are importing other basic important libraries like pandas, numpy, matplotlib, seaborn etc.
# Importing other important libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Accessing the web pages and scraping the content
To open this particular page here I used openurl function of the library urllib through its request module. And then passing the html content with the BeautifulSoup function. Don’t bother about the ‘lxml‘ part right now. You will get to know about it later on.
# Opening the particular url using openurl function
from urllib.request import urlopen
url='https://dibyendudeb.com/what-is-web-scraping-and-why-it-is-so-important-in-data-science/'
html= urlopen(url)
soup=bs(html,'lxml')
Now the BeautifulSoup has a very useful function called find_all to find all the HTML content with a particular tag. You can explore these tags from the inspect option when you do right-click on any part of a web page.
See below the images to understand the way you can identify particular HTML tag of any specific web page content.
The records of this table are with tr and td tags. Which clearly indicates that we need to apply the find_all function with these two tags.
So let’s apply find_all to get all the values with these two tags from this web page and then create string objects with them. The resultant content we have stored here in a list.
Creating list of web scraped content
# to print all rows in a table
records=soup.find_all('tr')
# creating list with the text
text_list=[]
for row in records:
row_store=row.find_all('td')
text_store=str(row_store) # creating a string object from the given object
onlytext=bs(text_store,'lxml').get_text() # using BeautifulSoup method to collect the text as a list using get_text() function
text_list.append(onlytext)
In the next step, we need to create a data frame from this list to make the data ready for further analysis. Print the data frame to see the records of the table within the HTML tags we mentioned in the code.
But this data need to split to create separate records according to the comma-separated values. The following line of codes creates a proper shaped data structure with multiple columns.
Some more refinement here. You can notice some unwanted braces are present with the records. The following code will fix these issues.
# Removing the opening bracket from the column 0
df1[0] = df1[0].str.strip('[')
# Removing the closing bracket from the column 9
df1[4] = df1[4].str.strip(']')
df1.head(10)
Creating table header
The table has some digits as indices which need to be corrected. We need the first row values as the index values. Let’s do this change. The following few sections first separate the first row in a data frame and then concatenate them to create the final data frame with desired index values step by step.
# Storing the table headers in a variable
headers = soup.find_all('strong')
# Using BeautifulSoup again to arrange the header tags
header_list = []# creating a list of the header values
col_headers = str(headers)
header_only = bs(col_headers, "lxml").get_text()
header_list.append(header_only)
print(header_list)
After the above step, we now have an almost complete table with us. Though need some more refinement, let’s start with the first step. We need the indices as the header of the table. So, we are here renaming the columns of the data frame.
# Assigning the first row as table header
df4=df4.rename(columns=df4.iloc[0])
df4.head(10)
You can see that the table header here has got replicated as the first record in the table, so we need to correct this problem. Lets drop the first repeated row from the data frame.
# Droping the repeated row from the data frame
df5 = df4.drop(df4.index[0])
df5.head()
So, as we have almost the final table now lets explore the basic information about the data in our hand.
df5.info()
df5.shape
Check for missing value
Although the table we have on this web page does not have any missing value, still to have a check and eliminating any row with missing value is a good practice. Here the dropna function does the trick for us.
# Eliminating rows with any missing value
df5 = df5.dropna(axis=0, how='any')
df5.head()
If you print the columns separately, you can notice some unwanted space and braces are there with the column names. Let’s get rid of them and we are done with refining the data frame. The blank spaces in the column names are clear when we print them.
df5.columns
These white spaces may cause problem when we refer them in the analysis process. So, we need to remove these spaces with the help of following code.
# Some more data refinement to make the dataset more perfect
df5.rename(columns={'[Name': 'Name'},inplace=True)
df5.rename(columns={' Weight]': 'Weight'},inplace=True)
df5.rename(columns={' Height': 'Height'},inplace=True)
df5.rename(columns={' Age': 'Age'},inplace=True)
df5.rename(columns={' Gender': 'Gender'},inplace=True)
print(df5.head())
So, here is the final table with the required information.
Exploring the web scraped data
Here we will explore the basic statistics about the data. The data has two main variables “Weight” and “Height” lets get their description.
Histogram is also a good data exploration technique describing distribution of any variable. We can check if the data is normal or with some deviation.
#histogram
sns.distplot(df5['Height']);
Relationship between the two main variables. We will plot a scatter diagram between the variables height and weight to see how they are correlated.
# Relationship between height and weight using a scatterplot technique
df5.plot.scatter(x='Height', y='Weight',ylim=(0.800000))
Thus we have completed web scraping of a table from a web page. The technique demonstrated above is applicable to all similar cases of data scraping whatever be the size of data. The data gets stored in python data frame and can be used for any kind of analysis.
Web scraping in data science from multiple web pages
Now a more complex and rather practical example of web scraping. Many times we see that the particular information we look for are scattered through more than one pages. In this case, some additional skill is required for data scraping from these pages.
Here we will take such an example from this website itself and try to scrape the titles of all the articles written. This is only one parameter that we want to collect but the information is spread through multiple pages.
Taking one parameter will keep the code less complex and easy to understand. But it is equally effective as the process of scraping data of one parameter and multiple parameters is the same.
The index page with URL https://dibyendudeb.com has a total of five pages containing the list of all the articles the website contains. So, we will navigate through all these pages, grab the article titles in a for loop and scrape the titles using BeautifulSoup method.
Importing libraries
To start with the coding the first steps are as usual importing the required libraries. Except for the regular libraries like pandas, NumPy, matplotlib and seaborn we need to import the specialized libraries for web scraping like BeautifulSoup and requests for grabbing the URL of web pages.
# Importing required libraries
from bs4 import BeautifulSoup as bs # defines the basic interface called by the tree builders
from requests import get
# Importing other important libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Use of BeautifulSoup for web scraping in data science
Here we will try the main scraping part. The most important task we need to do here is to identify the particular HTML content tag. This is a little experience-based skill. The more you scan the HTML content of web pages more you become comfortable to identify them.
As we have mentioned before, we want to store the article titles only. So our goal is to identify the particular HTML tag which wraps the title of an article. Let’s take a look at the home page of this website https://dibyendudeb.com. Select any of the article titles, right-click and select inspect from the option. You will get a view as below.
Here you can see the exact HTML part of the selected web page gets highlighted. It makes our task very easy to identify the particular HTML tag we need to find from the web page. Let’s scan the HTML part more closely to understand the structure of it.
Scanning the HTML code for web scraping
The above image presents the nested HTML code with the title section. You can here identify the particular division class containing the title of the article. So, we now know which HTML tag and the class unique tag we need to find from the scraped HTML content.
Now the below line of code will grab the particular URL and stores all the required information from all the five pages in a raw form. We will scan the whole part of this code line by line and understand functions of a particular piece of code.
# Opening the particular url using openurl function
from urllib.request import urlopen
titles=[]
pages = np.arange(1, 5, 1)
for page in pages:
#read and open url to scrape
page = "https://dibyendudeb.com/page/" + str(page)
html= urlopen(page)
soup=bs(html, 'html.parser')
topics = soup.find_all('div', attrs={'class': 'obfx-grid-col-content'})
for container in topics:
title=container.h1.a.text
titles.append(title)
Creating an empty list
The titles=[] is an empty list is declared first to store the titles. Then we have created a variable called pages to hold the page numbers which are 1,2,3,4 and 5. The np.arrange takes the first value, last value and the interval we want between them.
Use of for loop
After that, we enter into a for loop which iterates through the pages to get the content from all the web pages. Now when we are storing the URLs into the page variable, we need to pass a variable which changes with the change of web pages.
Scanning multiple webpages
If we check the URL of different pages of the web site, you can notice the page no. gets associated with the URL. For example, the URL of page 2 is https://dibyendudeb.com/page/2/. Likewise, page 3 and 4 get the page no. with the URL.
Accessing the URL
So, we need to simply concatenate the page no. with the URL within the for a loop. Which has been done with the code "https://dibyendudeb.com/page/" + str(page). We need to convert the page no. into a string as it will be a part of the URL.
html= urlopen(page): With this piece of code the page contents have been stored in the variable html variable. Next, soup=bs(html, 'html.parser') the BeautifulSoup function parses the HTML content in a variable called soup.
As we now we have the parsed HTML content of all the pages, we need to find only the article titles. The process of finding the article title I have already explained above.
Use of find_all and nested for loop to populate the list
topics = soup.find_all('div', attrs={'class': 'obfx-grid-col-content'}). Here in this code, the HTML code with required attributes have been stored in the variable named topics
for container in topics: title=container.h1.a.text titles.append(title) : This piece of code is another nested for loop part which scans through the content of “topics” and stores the exact title in the variable “title“. And finally, the title gets appended in the “titles" list.
With this part, we have completed the scraping part. What remains is just refining the data and creating data frame from the list. Lets first check what we have scraped from the website. We are going to print the length and the content of the variable titles.
print(len(titles))
print(titles)
The scraped content
The length is 40. Which is exactly the number of the total articles the website contains. So, we are satisfied that the code has done what we expected from it. The content also confirms it. Here are a few starting lines from the output of the print command.
40 ['\n\t\t\t\t\tDeploy machine learning models: things you should know\t\t\t\t', '\n\t\t\t\t\tHow to create your first machine learning project: a comprehensive guide\t\t\t\t', '\n\t\t\t\t\tData exploration is now super easy with D-tale\t\t\t\t', '\n\t\t\t\t\tHow to set up your deep learning workstation: the most comprehensive guide\t\t\t\t', '\n\t\t\t\t\tWhy Ubuntu is the best for Deep Learning Framework?\t\t\t\t', '\n\t\t\t\t\tAn introduction to Keras: the most popular Deep Learning framework\t\t\t\t', '\n\t\t\t\t\tA detailed discussion on tensors, why it is so important in deep learning?\t\t\t\t', '\n\t\t\t\t\t ............................................................................................................... More lines
Creating data frame from the list
Web scraping in data science is incomplete unless we have a data frame of the content. So, here we create a data frame from the list titles. We give the index name of the column as “Blog title”. Some basic information and the first 10 rows of the data frame are displayed.
The web scraped data is now in a data frame. But there are some unwanted spaces with character. Let’s clean it to have more refined data set.
# Replacing unwanted characters with a blank at the end of column 9
df1["Blog title"] = df1["Blog title"].str.replace('\n\t\t\t\t\t', '')
df1["Blog title"] = df1["Blog title"].str.replace('\t\t\t\t', '')
df1.head(10)
So, here is the final data set containing titles of all the articles the web site contains. The first 10 article titles are displayed here. Congratulations!!! you have successfully completed a whole exercise of web scraping in data science. Below is a look of the final data frame.
The below piece of code creates a comma separated file to store the data frame. You can open it in excel for further analysis.
df1.to_csv('titles.csv')
Final words
So, we have thoroughly learned all the techniques of web scraping in data science. The article presents the basic logic behind scraping any kind of data from a particular web page and also from multiple pages.
These two are the most prevalent scenarios when we want to web scrape for data science. I have used the same web site for both the example but the logic is same for any kind of source. Personally I have found these logic very useful and applied for my analysis quite a lot.
The article will help you in many ways to collect your data of interest and take an informed decision. As the use of the internet grows astronomically, the businesses become more dependent on data. This era where data is the ultimate power and you need to use it wisely to survive the competition. And in this context, web scraping of data can give you a significant competitive advantage.
So, I am hopeful that the article will also help you in your tasks for web scraping in data science. It is very useful as well as an interesting trick. Don’t just read it, copy the code and run in your machine to see the result. Apply the same with any other source and see how it works.
Finally let me know if you find the article helpful. Post if you have any questions in the comment. I will love to answer them 🙂
To deploy machine learning(ML) models means to take a machine learning model from development to production. You have built an ML model, validated and tested its performance. But what its use if it is not utilised to solve real-world problems? Deploying a model means making an ML model production-ready. Here in this blog, we will discuss the steps of this process.
The deployment process takes a model out from the laboratory or from the data scientist’s desk and makes its appropriate use. There are lots of model across all sectors of industry and most of them are never in use. Every time develop any model the obvious question I have faced from my peers is “how do you make people use the models you develop?”
Why do we need to deploy a model?
Actually, I think this is the primary question that should appear in the mind of data scientists even before model building. Maurits Kaptein of Tilburg University has rightly pointed out in his article that ”the majority of the models trained in … never make it to practice”. This is so true. In his article, he illustrated how brilliant models are developed in medical research die silently in the notebook only as they never reach other health care professionals to reap the benefit.
This is where the importance of model deployment lies. In the agricultural term, we sometimes coin this process as “Lab to Land” process. That means what technologies scientists develop in the lab should reach the land for practical implementation.
Now in case of ML model development, the life cycle is different from that of software development. In the case of software development, the requirement analysis of the client is an integral part. But the data scientists are less concern about the model’s implementation process and highly focused to build a satisfactory model with high precision.
Lets first discuss a machine learning model’s life cycle to understand why the deployment process is so important.
Model development life cycle
A machine learning model development process has several steps. And the model needs to keep always updated. The model needs to be trained with fresh and relevant data in order to keep it updated. See the below diagram of a machine learning model life cycle. Notice that the last stage fo development of an ML model involves iterative steps of updating the model with new data.
A model once developed and forgot remains no longer relevant to the rapidly changing scenario. Target -feature relationship changes, features evolve also new features get added. So it is a continuous and dynamic process. In an ideal condition, the model development and production team should remain in continuous touch to maintain the updation process.
This is also known as end-to-end model data science workflow. It includes data preparation, exploratory data analysis, data visualization, model training, testing, prediction and performance testing. Once a model performance is satisfactory the model is ready for deployment.
What is model deployment?
Suppose a data scientist has developed a model in his computer using an interactive notebook. Now he wants to encapsulate it in such a way that its prediction or analyzing capability can straightway be utilised by the end-users. To do that the data scientist can adopt a number of ways to deploy his/her project. Let’s discuss them one by one.
Creating libraries and packages
This refers to the process to encapsulate your code in a library. The programming language can be anyone of your choice. An ML model created in R, Python, Scala, etc., a library encapsulates all the functionalities of the ML model. It is ready to use for any other data scientists on their own data.
A library or package created to deploy any data science model needs to be updated at regular intervals. For this purpose, it also has the feature to maintain its version in the repository. This feature helps you to keep track of versions and allows the flexibility to use any particular library version.
Hosted or static notebook
Using Jupyter notebook is the most popular way of creating ML models. It is an interactive IDE which allows you to write code and also data visualization, writing texts all at one place.
When you finished with the model development part, you can use the same notebook to host either in Github or Jupyter nbviewer or Anaconda cloud either as a static notebook or as a rendered notebook service. You need to just take care of the basics of deployment. Other nitty-gritty like security, scalability, compatibility issues are taken care of by the hosting service itself.
You can give version numbers to your notebook so that as it gets updated with new data, tracking the version change is possible. This form of deployment is very attractive for any business analysts ready with an updated business report.
Also, it enables end-users with no access to any kind of resources either data or computing to take the benefit of data exploration and visualization. On the other hand, the trade-off is being static report it limits the interaction and poor real-time experience.
Use of REST APIs
This is another way of deploying your machine learning models. In this case, a data scientist once done with model building task deploy it as REST (Representational State Transfer) API. And then other production engineers provide the required layers of visualization, dashboards or web applications. Then the end-users make use of the machine learning model from the REST API endpoints.
An ideal example of use of such APIs is ML models built in Python language. Python has the full and exhaustive set of modules which can take care of all the steps starting from data handling, model building to model deployment.
It has data handling libraries like Pandas, NumPy, model building libraries like Scikit-Learn, Tensor, Keras etc. And then a vast range of frameworks like Django, Flask etc. to deploy the model built. So learning a single language can make you self sufficient from building a model to its deployment.
Interactive applications
This is also a popular form to deploy machine learning models. It provides the endusers an easy interactive interface to explore, analyze, try any of the ML algorithms and to export the data for further analysis.
Interactive applications does have a server side component. Thus it is not static like hosted or static notebooks. It allows its users to interact with the data and a real time data handling experience.
For example, Bokeh application is such a powerful interactive application. Users can play with the data with the number of functionalities provided in the interface like sliders, drop-down menus, text fields etc.
Dashboards
It is a very popular form of production technique where the user can perform exploratory analysis and understand the deployed project. Here at a time a large number of users can take part to explore the result.
Jupyter notebook is as of now the most preferred way to deploy ML projects as dashboards. Like its interactive notebook, the dashboard has also components for interactively designing the report lay-outs. You can control it to make grid-like or report-like formats.
Issues in deploying machine learning models
So a model needs to get deployed as an interactive application. But many a time it has been observed that the deployment part takes months together to become fully functional. And the problem is that after such a gap the ML model developed gets obsolete. The data it is trained with needs to be updated as well as the training process.
It becomes more of a problem as the data scientist handovers the model to the engineers involved in the deployment. So changes in the model require again involving the data scientists which are not always possible. Even in case, it has been deployed already, the model needs to update time to time. So, the development team and production team need to work in unison.
The gravity of the problem can be easily understood if we consider a practical case of application of Machine Learning models. Lets take the example where credit card companies uses predictive modeling technique to detect fraudulent credit card transactions.
Suppose we have developed an ML model which predicts the probability of a credit card transaction as a fraudulent transaction. Now the model needs to deliver the result the moment credit card transaction happens in realtime.
If the model takes time longer than 5 minutes then what is the use of such a model? The credit card company needs to make a decision the moment a fraud transaction is taking place and flag it. Prediction accuracy is also of utmost importance. If it predicts a fraud with an accuracy of less than 50%, then it is no way more efficient than tossing a coin.
Serverless deployment
So what is the solution? How can be a model kept always updated? A model which is based on old data and not accurate has no industrial value. So, serverless deployment can be a good solution to overcome the issues mentioned above. Serverless deployment is like the next level of cloud computing.
All the intricacies of the deployment process are taken care of by the deployment platform. The platform completely separates the server and application part in case of the serverless deployment process. Thus, data scientists can pay full attention to the development of efficient machine learning models.
Here is a very good article on the serverless deployment of data science model. To apply the process successfully you need to have some knowledge of cloud computing, cloud function and obviously machine learning.
Types to deploy machine learning models
Suppose a product manager of any company has found out a solution for customer centric problem in his business and it involves the use of machine learning. So, he contacts data scientists to develop the machine learning algorithm part of the total production process.
But a machine learning model life cycle and a software development life cycle differes. Most of the cases the developers of the model have no clue how the model can ultimately be taken to production stage. So the product manager needs to clearly state his requirement to the whole team to meet the end goal.
Now the deployment of a machine learning model majorly depends on the end-user type. How quickly the client needs the prediction result and the interface. Depending on these criteria the product manager needs to decide how the final product should look like. Let’s take a few examples of such real-world applications of machine learning deployment cases.
Example 1: ML model result gets stored in database only
This is a situation where the client has some knowledge of SQL and can fetch the required data from the database. So here if the production manager can only store the ML output in a designated database and his task is complete.
Use of lead scoring model can be a good example of such a situation. Lead scoring is generally a technique followed by marketing and sales companies. They are interested to know the market interest in their products. There are different parameters which indicate market readiness of their product.
A lead scoring model analyses all these parameters like the no. of visits of the product page, lead generation, checking the price, no. of clicks etc. to predict the lead score. Finally the lead score gets stored in a database and revised on daily basis or as per the client’s requirement.
Example 2: the data needs to be displayed on the interface
Here the situation is the marketing executive does not know SQL and unable to fetch the required data from the database. In this case, the product manager needs to instruct his engineers to go one step further than the earlier one. They now need to display the data through Customer Relationship Management (CRM) platform. Which needs to Extract-Transform-Load operations to integrate the data from different sources.
Example3: interactive interface
In this case the user interface is interactive. The ML model result operates on the end-user’s input and returns required result. This can be web application or mobile apps. For example, several decision support systems are there where users input their particular condition and the application guide them with proper recommendations.
Mobile apps like Plantix (see the below clip from the developers) helps users to know the plant disease accurately. The user needs to click pictures of the disease affected part of the plant and the image recognition algorithm of the app determines the disease from its already stored image libraries. Additionally, this app helps the user with proper guidance to get rid of the problem.
Conclusion
Any good Machine Learning model if not deployed has not practical use. And this is the most common case across the industry. Efficient models are developed but they never see the day light and remains forever in the notebook. This is mainly because lack of harmony between the development phase and production phase and some technical constrains like:
Portability of the model
The data science models developed in a local computer works fine until it changes the place of execution. The local computer environment is ideally set for model execution and the data. So to make it deployable either the new data has to reach to the model or the model has to reach the data. From a portability point of view, the latter option is more feasible.
Data refinement
During model development the data scientists procure data from multiple sources and preprocess them. Thus the raw data takes good shape and in ideal form to feed the ML algorithms. But as soon as it goes to production phase, it has to deal with the client’s raw data without any filtering and processing. So model’s robustness is a concern for their deployment.
The latency period
While the model is in the development phase, it has to deal with huge data set. Model training, validation, testing and finally prediction, quite obvious the time takes in this process is long enough. But while in production the prediction process may take a few example case and deliver the prediction. So the expected latency is far less. Also, the industry’s requirement is a real-time prediction most of the time.
So, a data scientist needs to take all the above factors into account while model development. There are several approaches like using containers, using good memory management practice, server less deployment which help to overcome the technical constrains to a great extent. However ML model development and deployment is a continuous process and refinement, training of the model goes on even after a successful deployment.
I have tried to present all the aspects while deploying machine learning models. It is a vase topic and a single article is not enough to discuss it in detail. Still I have covered most of the important factors briefly here so that you can have a basic idea at one place.
In coming articles I will try to cover details with in-depth articles taking one aspects at a time. Please let me know your opinion about the article, any queries regarding the topic by commenting below. Also mention if there is any particular topic you want to read next.
This article contains a brief discussion on python functions. In any programming language, be it Python, R, Scala or anything else, functions play a very important role. Data science projects require some repetitive tasks to perform every time to filter the raw data and while data preprocessing. In this case, functions are the best friend of a data scientist. They save them from doing the same task every time by simply calling the relevant function.
Functions, both inbuilt and user-defined are a very basic yet critical component in any programming language and python is no exception. And here is a brief idea about them, so that you can start using the benefit they provide.
Why use Python for data science? Python is the most favourite language among data enthusiasts. One of the reason is Python is very easy to understand and code with compare to any other language.
Besides, there are lots of libraries from third parties which make data science tasks a lot easier. Libraries like Pandas, NumPy, Scikit-Learn, Matplotlib, seaborn all contain numerous modules catering almost all kind of function you wish to perform in data science. Libraries like Tensorflow, Keras are specially designed for deep learning applications.
If you are a beginner or you have some basic ideas about coding in other programming languages, this article will help you get into python functions as well as creating a new one. I will discuss here some important Python functions, writing your own functions for repetitive tasks, handling Pandas data structure with easy examples.
Like other objects of python like integer, string and other data types function are also considered as the first-class citizen in python. They can be dynamically created, destroyed, defined in other functions, passed as arguments in other functions, returned as values etc.
Particularly if we consider the field of data science, we need to perform several mathematical operations and pass on calculated values further. So, the role of python functions is very crucial in data science to perform any particular repetitive calculation, as nested function, to be used as argument of another function etc.
So without much ado, lets jump into details of it and some really interesting use of function with examples.
Use of Python functions for data science
Using functions is of utmost importance not only in Python but in any programming language. Be it inbuilt function or user-defined functions you should have a clear idea how to use them. Functions are very powerful to make your coding well structured and increases its usability.
Some functions are there in Python, we just need to call these built in functions to perform the assigned tasks. Most of the basic tasks we need to do frequently in data operations are well covered in these functions. To start with I will discuss some of these important built in python functions.
Built in python functions
Let’s start with some important inbuilt functions of Python. These are already included and makes your coding experience much smoother. The only condition is you have to aware of them and frequently use them. The first function we will discuss is help().
So take help()
Python functions take care of most of the tasks we want to perform through coding. But the common question comes into any beginner’s mind is how will he/she know about all these functions?. The answer is to take help.
The help function is there in Python to tell you every detail about any functions you need to know to use them. You just need to mention the function with help. See the example below.
# Using help
help(print)
Here I want to know about the print function, so I mentioned it within the help. Now see the help describes everything you need to know to apply the function. The function header with optional arguments you need to pass, their role. It also contains a brief description of the function, what it does in English.
Interestingly you can know all about the help() function using the help function itself :). It is great to see the output. Please type to see it yourself.
# Using help() for help
help(help)
Again here help has produced all necessary details about itself. It says that help() function is actually a wrapper around pydoc.help that provides a helpful message for the user when he types “help” in the Python interactive prompt.
List() function
A list is a collection of objects of same or different data types. It has very frequent use in storing data and later used for operations in data science. See the below code to create a list with different data types.
# Defining the list item
list_example=["Python", 10, [1,2], {4,5,6}]
# Printing the data type
print(type(list_example))
# Printing the list
print(list_example)
# Using append function to add items
list_example.append(["new item added"])
print(list_example)
Above code creates a list with a string, a digit, array and set. The type function to print the type of data. And at last, the append() function used to add an extra item in the list. Let’s see the output.
So, the data type is list. All the list items are printed. And an item is appended in the list with append() function. Note this function as it is very handy while performing data analysis. You can also create a complete list from scratch only using the append() function, see the below example.
sorted() function
This is also an important function we need frequently while doing numeric computation. For example a very basic use of sort() is while calculating the median of a sample data. To find out the median, we need to sort the data first. By default the function sort() arrange the data in ascending order, but you can do the reverse also by using the reverse argument. See the example below.
# Example of sorted function
list_new=[5,2,1,6,7,4,9]
# Sorting and printing the list
print("Sorting in an ascending order:",sorted(list_new))
# Soritng the list in descending order and printing
print("Sorting in an descending order:",sorted(list_new,reverse=True))
And the output of the above code is as below:
round() function
This function is useful to give you numbers with desired decimal places. The required decimal place is to be passed as an argument. These decimal number has some unique properties. See the below example and try to guess what will be the output, it is really interesting.
# Example of round() function
print(round(37234.154))
print(round(37234.154,2))
print(round(37234.154,1))
print(round(37234.154,-2))
print(round(37234.154,-3))
Can you guess the output. See the second argument can be negative also!. Lets see the output and then explain what the function does to a number.
When the round() function has no argument, it simply discards any decimal digits. It keeps up to two decimals if the argument is 2 and one decimal when it is 1. Now when the second argument is -2 or -3, it simply returns the closest integer with multiple of 100 or 1000.
If you are surprised where on the earth such a feature is useful; then let me tell you that there are some occasions like mentioning a big amount (money, distance, population etc) where we don’t need an exact figure, rather a rounded close number can do the job. In such cases to make the figure easier to remember, round() function with a negative argument is used.
Now there are a lot more in-built functions, we will touch them in other articles. Here as an example I have covered few of them. Lets move on to the next section of user-defined function. It gives you freedom to create your own functions.
User defined functions
After inbuilt functions, here we will learn about user defined functions. If you are learning Python as your first programming language, then I should tell you that functions in any programming language are the most effective as well as an interesting part.
Any coder’s expertise depends on how skilled he is in creating functions to automate the repetitive tasks. Instead of writing code for the same tasks again and again a skilled programmer writes some function for those tasks and just call them when the need arises.
Below is an example how can you create a function of adding two numbers.
# An example of user defined function
def add (x,y):
''' This is a function to add two numbers'''
total=x+y
print("The sum of x and y is:", total)
The above is an example of creating a function which will add two numbers and then print the output. Let’s call the function to add two numbers and see the result.
I have called the function, passed two digits as arguments and the user-defined function printed the result of adding the numbers. Now anytime I will need to add two numbers I can just call this function instead of writing those few lines again and again.
Now if we want to use help for this function, what will help return? Lets see
See help() function has returned the text I have put within three quoted strings. It is called the docstring. A docstring allows us to describe the use of the function. It is very helpful as complex programmes require a lot of user-defined functions. The function name should indicate its use but many a time it may not enough. In such cases, a brief docstring is very helpful to quickly remind you about the function.
Optional arguments in user-defined function
Sometimes providing an optional argument with the default argument save us writing additional lines. See the following example:
Can you guess the output of the following function calls? Just for fun try without seeing the below output. While trying notice that once the function has been called with an optional argument.
Here is the output.
See for the first call of the function, it has printed the default argument. But when we passed “python” as an optional argument, it has overridden the default argument. Again in the third case without any optional argument, the default gets printed. You should try any other combinations come in your mind, it is complete fun and also your concept will get clear.
Nested functions
Nested functions are when you define functions inside another function. This is also one of the very important python basics for data science. Below is an example of a very simple nested function. Try it yourself to check the output.
# Example of nested functions
def outer_function(msg):
# This is the outer function
def inner_function():
print(msg)
# Calling the inner function
inner_function()
# Calling the outer function
outer_function("Hello world")
Functions passed as argument of another function
Functions can also be passed as an argument of another function. It may sound a little confusing at first. But it is really a very powerful property among the python basics utilities for data science. First, take an example to discuss it. See the below piece of code to check the property.
# Calling functions in a function
def add(x):
return 5+x
def call(fn, arg):
return (fn(arg))
def call_twice(fn, arg):
return(fn (fn(arg)))
print(
call(add, 5),
call_twice(add, 5),
sep="\n"
)
Again you try to understand the logic and guess the output. Copy the code and make little changes to see the change or error it produces. The output I got from this code as below.
Did you guess it right? See here we have created three functions namely add(), call() and call_twice(). And then passed the add() function into other two functions. The call() function has returned the add function with argument 5 so the output is 10.
In a similar fashion, the call_twice() function has returned 15 due to the fact that it has a return statement with a nested function and argument combination. I know it is confusing to some extent. This is because the logic has not come from a purpose. When you will create such functions to really solve some problem the concept will get clear. So, do some practice with the code given here.
This article is to help you to start with your first machine learning project. Machine learning projects are very important if you are serious about your career as a data scientist. You need to build your profile with a number of machine learning projects. These projects are evidence of your proficiency and skill in this field.
The projects are not necessarily only complex problems. They can be very basic with simple problems. What is important is to complete them. Ideally, in the beginning, you should take a small project and finish it. It will boost your confidence as you have successfully completed it as well as you will get to learn many new things.
So, to start with I have also selected a very basic problem which is the classification of Iris data set. You can compare it with the very basic “Hello world” program that every programmer writes as a beginner. The data set is small that’s why easy to load in your computer; consists of a few no. of features only so implementation of any ML algorithm is easier.
I have used here Google Colab to execute the Python code. You can try any IDE you generally use. Feel free to copy the code given here and execute them. The first step is to use the existing code without any error. Afterwards, make little changes to see how the output gets affected or gives errors. This is the most effective way to know a new language as well as its application in Machine Learning.
The steps for first machine learning project
So, without much ado, lets jump to the project. You first need to chalk out the steps of implementing the project.
Importing the python libraries
Importing and loading the data set
Exploring the data set to have a preliminary idea about the variables
Identifying the target and feature variables and the independent-dependent relationship between them
Creating training and testing data set
Model building and fitting
Testing the data set
Checking model performance with comparison metrics
This is an ideal sequence how you should proceed with the project. As you gain experience you will not have to remember them. Being the first machine learning project I felt it necessary to mention them for further reference.
Importing the required libraries
# Importing required libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
import numpy as np
About the data
The data is collected from UCI machine learning repository, Iris data set and created by Dr R. A. Fisher. It contains three Iris species viz. “Setosa”, “Versicolor” and “Virginica” and four flower feature namely petal length, petal width, sepal length and sepal width in cm. Each of the species represents a class and has 50 samples each in the data set. So the Iris data has total 150 samples.
This is the most popular and basic data used in pattern recognition to date. The data source is UCI machine learning repository and it is a little different from the same Iris data set found in R.
The following line of code will load the data set in your working environment.
# Loading the data set
dataset = load_iris()
The following code will generate a detail description of the data set.
# Printing some data features
dataset.DESCR
Description of Iris data
.. _iris_dataset:
Iris plants dataset
--------------------
**Data Set Characteristics:**
:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- class:
- Iris-Setosa
- Iris-Versicolour
- Iris-Virginica
:Summary Statistics:
============== ==== ==== ======= ===== ====================
Min Max Mean SD Class Correlation
============== ==== ==== ======= ===== ====================
sepal length: 4.3 7.9 5.84 0.83 0.7826
sepal width: 2.0 4.4 3.05 0.43 -0.4194
petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)
============== ==== ==== ======= ===== ====================
:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988
The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.
This is perhaps the best known database to be found in the
pattern recognition literature. Fisher's paper is a classic in the field and
is referenced frequently to this day. (See Duda & Hart, for example.) The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant. One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.
.. topic:: References
- Fisher, R.A. "The use of multiple measurements in taxonomic problems"
Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
Mathematical Statistics" (John Wiley, NY, 1950).
- Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
(Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.
- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
Structure and Classification Rule for Recognition in Partially Exposed
Environments". IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. PAMI-2, No. 1, 67-71.
- Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions
on Information Theory, May 1972, 431-433.
- See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II
conceptual clustering system finds 3 classes in the data.
- Many, many more ...
Checking the data type
We can check the data type before proceeding for analytical steps. Use the following code for checking the data type:
# Checking the data type
print(type(dataset))
Now here is a problem with the data type. Check the output below, it says it is a sklearn data.
Although the most common data type we are used to is Pnadas dataframe. And also the target and feature are stored here separately. You can print them separately using the following lines.
# Printing the components of Iris data
print(dataset.target_names)
print(dataset.target)
print(dataset.feature_names)
See the print output below. The target variables are the three Iris species “Setosa”, “Versicolor” and “Virginica” which are coded as 0,1 and 2 respectively. And the features are also stored separately.
And the feature values are stored separately as data. Here is first few rows of the data.
# Printing the feature data
print(dataset.data)
Converting the data type
For the ease of further modelling process, we need to convert the data type from sklearn to the most common Pandas data type. And we also need to concatenate the separate data and target with column names as feature_names and target. The np.c_ function concatenates the data set.
# Converting scikit learn dataset to a pandas dataframe
import pandas as pd
df = pd.DataFrame(data= np.c_[dataset['data'], dataset['target']],columns= dataset['feature_names'] + ['target'])
df.head()
See below few lines of the combined dataframe. With this new dataframe we are now ready to proceed for the next step.
Check the shape of the newly created dataframe as I have done below. The output confirms that the dataframe is now complete with 150 samples and 5 columns.
# Printing the shape of the newly created dataframe
print(df.shape)
Creating target and feature variables
Next, we need to create variables storing the dependent and independent variables. Here the target variable Iris species is dependent on the feature_variables so the flower properties i.e. petal width, petal length, sepal length and sepal width are independent variables.
The data set printed above, you can see that the first four columns are independent variables and the last one has the dependent variable. So, in the below line of codes, variable x is to store the values of first four columns and y for the target variable.
# Creating target and feature variables
x=df.iloc[:,0:4].values
y=df.iloc[:,4].values
print(x.shape)
print(y.shape)
The shape of x and y is as below.
Splitting the data set
We need to split the data set before applying Machine learning algorithms. The train_test_split() function of sklearn has been used here to do the task. The test data size is set as 20% of the data.
# Splitting the data set into train and test set
x_train, x_test, y_train, y_test=train_test_split(x,y,test_size=0.2,random_state=0)
print(x_train.shape)
print(x_test.shape)
Accordingly, the train data set contains 120 sample data whereas the test data set has 30 sample data.
Application of Decision tree algorithm
So, we have finished with data processing steps and ready to apply the Machine Learning algorithm. I have chosen here a very popular classification algorithm which is Decision Tree algorithm for the first machine learning project.
If this algorithm is new to you, you can refer to this article to learn details about it and how it can be applied with Python. The speciality of this ML algorithm is that its logic is very simple and the process is not black box like most other ML algorithms. Which means that we can see and understand how the decision-making process is going on.
So let’s apply this ML model to the training set of Iris data. The DecisionTreeClassifier() of sklearn is the function here which we have imported in the beginning.
# Application of Decision Tree classification algorithm
dt=DecisionTreeClassifier()
# Fitting the dt model
dt.fit(x_train, y_train)
The model thus applied on the training set. In the below screenshot of my Colab notebook you can see the classifier has several parameters specifying the decision tree formation. At this stage you don’t need to bother about all these specifications. We can discuss each of them and what is their function in another article.
Prediction using the trained model
To test the model we will first create a new data. As this data has not been used in model building so the prediction will not be biased.
# Creating a new feature set i.e. a new flower properties
x_new = np.array([[4.9, 3.0, 1.4, 0.2]])
# Predicting for the new data using the trained model
prediction = knn.predict(x_new)
print("Prediction:",prediction)
See the prediction result using the trained Decision Tree classifier. It gives the result as 0 which represents the iris species “Setosa”. We have discussed before the Iris species are represented in the data frame with digits 0,1 and 2.
Lets try to predict the result using the test set with 20% of data kept independent while model training. We will also use two metrics suggesting the goodness of fit of the model.
y_pred = dt.predict(x_test)
print("Predictions for the test set:",y_pred)
# Metrics for goodness of fit
print("np.mean: ",np.mean (y_pred == y_test))
print("dt.score:", dt.score(x_test, y_test))
And the output of the above piece of code is as below.
You can see that the testing accuracy score is 1.0!. So, it is indicating a problem. The problem of overfitting. Which is very common with Decision Tree Classification. Overfitting suggests that the model is a too good fit for this particular data set. Which is not desirable. And ideally, we should try other machine learning models to check their performance.
So in this section next we will not take up a single ML algorithm, rather we will take up a bunch of ML algorithms and test their performance side by side to choose the best performing one.
Application of more than one ML models simultaneously
Along with these ML models another segment which I am going to introduce is known as Ensemble models. The specialty of this method is that an ensemble model uses more than one machine learning models at a time to achieve more accurate estimation. See the below figure to understand the process.
Now there are two kinds of ensemble models which are Bagging and Boosting. I have incorporated both kinds of ensemble models here to compare them with other machine learning algorithms. Here is a brief idea about Bagging and Boosting ensemble techniques.
Bagging
The name is actually Bootstrap Aggregation. It is essentially a random sampling technique with replacement. That means here once a sample unit is selected, it is again replaced back for further future selection. This method works best with algorithms which tend to have higher variance and bias, like decision tree algorithm.
Bagging method runs a different model separately and for the final prediction output aggregates each model’s estimation without any bias to any model.
The other ensemble modelling technique is:
Boosting
As an ensemble learning method, boosting also comprises a number of modelling algorithm for prediction. It associates weight to make a weak learning algorithm stronger and thus improving the prediction. The learning algorithms also learn from each other to boost the overall model performance.
The ensemble models we are going to use here are AdaBoostClassifier(), BaggingClassifier(), ExtraTreesClassifier(), GradientBoostingClassifier() and RandomForestClassifier(). All are from sklearn library.
Importing required libraries
# Importing libraries
from sklearn.model_selection import cross_val_score
from sklearn import ensemble
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
import matplotlib.pyplot as plt
import seaborn as sns
Application of all the models
Use this following lines of code to build, train and execute all the six models. It also consists of an array with name ml_compare[]. It stores all the comparison matrices calculated here.
# Application of all the ML algorithms at a time
ml = []
ml.append(('LDA', LinearDiscriminantAnalysis())),
ml.append(('DTC', DecisionTreeClassifier())),
ml.append(('GNB', GaussianNB())),
ml.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr'))),
ml.append(('SVM', SVC(gamma='auto'))),
ml.append(('KNN', KNeighborsClassifier())),
ml.append(("Ensemble_AdaBoost", ensemble.AdaBoostClassifier()))
ml.append(("Ensemble_Bagging", ensemble.BaggingClassifier()))
ml.append(("Ensemble_Extratree", ensemble.ExtraTreesClassifier()))
ml.append(("Ensemble_GradientBoosting", ensemble.GradientBoostingClassifier()))
ml.append(("Ensemble_RandomForest", ensemble.RandomForestClassifier()))
ml_cols=[]
ml_compare=pd.DataFrame(columns=ml_cols)
row_index=0
# Model evaluation
for name, model in ml:
model.fit(x_train,y_train)
predicted=model.predict(x_test)
kfold = StratifiedKFold(n_splits=10, random_state=1, shuffle=True)
cv_results = cross_val_score(model, x_train, y_train, cv=kfold, scoring='accuracy')
ml_compare.loc[row_index, 'Model used']=name
ml_compare.loc[row_index,"Cross Validation Score" ]=round(cv_results.mean(),4)
ml_compare.loc[row_index,"Cross Value SD" ]=round(cv_results.std(),4)
ml_compare.loc[row_index,'Train Accuracy'] = round(model.score(x_train, y_train), 4)
ml_compare.loc[row_index,"Test accuracy" ]=round(model.score(x_test, y_test),4)
row_index+=1
ml_compare
As all the models get trained and executed with the train set, they are simultaneously tested with the test data. The goodness of fit statistics gets stored in ml_compare[]. So, let’s see now what ml_compare[] tells us. The output is as below.
Visual comparison of the models
Although from the above table the models can be compared, it is always easier if there is a way to visualize the difference. So, let’s create a bar chart using the cross-validation score. we have calculated above. Use the following line of codes to create the bar chart with the help of matplotlib and seaborn module of sklearn.
# Creating plot to show the train accuracy
plt.subplots(figsize=(13,5))
sns.barplot(x="Model used", y="Train Accuracy",data=ml_compare,palette='hot',edgecolor=sns.color_palette('dark',7))
plt.xticks(rotation=90)
plt.title('Model Train Accuracy Comparison')
plt.show()
As the above code executes, the following bar chart is created showing the cross-validation scores of all the ML algorithms.
The verdict
So, we have classified the Iris data using different types of Machine Learning and ensemble models. And the result shows that they all are more or less accurate in identifying the Iris species correctly. But if still, we need to pick any one of them as the best, then we can do that based on the above comparative table as well as the graph.
For this instance, we have Linear Discriminant and Support Vector Machine performing slightly better than the others. But it can vary depending on the size of data and ML scores do change in different executions. You also check your result, which one you have found best and let me know through comments below.
Conclusion
So, congratulations you have successfully completed your very firs machine learning project with python. You have used a popular and classic data set to apply several machine learning algorithms. The data being a multiclass data set is an ideal example of real world classification problem.
To find out the best performing model, we have applied the six most popular Machine Learning algorithms along with several ensemble models. To start with the model building process, first of all, the data set has been divided into training and testing sets.
The training set is to build and train the model. The test data set is an independent data set kept aside while building the model, to test the model’s performance. This is an empirical process of model validation when independent data collection is not possible. For this project, we have taken an 80:20 ratio for train and test data set.
And at the last a no. of comparison metrics were used to find the model with the highest accuracy. These are essentially the ideal steps of any machine learning project. As it is your first machine learning project experience, so I have showed every step with all details. As you advance in experience you may skip some of them as per your convenience.
So, please let me know your experience with the article. Any problem you faced while executing the code or any other queries post them in the comment section below, I will love to answer them.
This article is to introduce you a really super easy data exploration tool from Python. You have to just install and import this simple module. It gets integrated with any python IDE you are using. And D-tale is ready with all its data exploration features and a very easy user interface.
Data exploration is a very basic yet very important step of data analysis. You need to understand the data the relationship between the variables before you dive deep into advance analysis. Basic data exploration techniques like visual interpretation, calculating the summary statistics, identifying the outliers, mathematical operations on variables etc. are very effective to gain a quick idea about your data.
These data exploration steps are necessary for any data science projects. Even in machine learning and deep learning projects also we filter our data through these data exploration techniques. And they involve writing a few lines of Python code which are usually repetitive in nature.
This is a complete mechanical task and writing reusable code helps a bit. But again you need code manipulation to some extent every time new data set in use. Every time we write “dataset.head()” wishing that had there been a user interface to do these basic tasks, it can be a big time saver.
So here comes D-tale to rescue us. D-tale is actually a lightweight web client developed over the Pandas data structure. It provides an easy user interface to perform several data exploration tasks without the need of writing any code.
What is D-tale?
D-tale is an open-source solution developed by SAS to Python conversion for visualizing your data using Pandas data frame. It encapsulates all the coding for implementing Pandas data structure operations in the backend so that you don’t need to bother about coding the same thing repeatedly.
SAS insight function earlier which eventually transformed into D-tale with a wrapper written in pearl script. D-tale also gets easily integrated with python terminals and ipython notebooks. You just need to install it in Python and then import it.
You can refer this link for further knowledge about this tool. It is from the developers and also contains some useful resources. Here is a good video resource by the developer of D-tale Andrew Schonfeld from FlaskCon 2020.
I am using it for some time and really liked it. It has made some of my regular repetitive data exploration tasks very easy. It saves lots of my time.
Here I will discuss in detail how can it be installed and start to use with screenshots from my computer while I have installed it.
Installation
The installation part is also a breeze. Within seconds you can install it and start to use. Just open your Anaconda Powershell Prompt from windows start. See the image below.
Now type the following command in Anaconda Powershell Prompt to install the D-tale in your windows.
Below is the screenshot of my computer’s anaconda shell. Both the Conda and Pip command has been executed. As you know that both of these commands function in a similar way. The only difference is pip installs from the Python package index whereas Conda installs packages from Anaconda repository.
Now you are ready to use the D-tale. Open your Jupyter notebook and execute the following codes.
# To import Pandas
import pandas as pd
# To install D-tale
import dtale
Example data set
The example dataset I have used here for demonstration purpose has been downloaded from kaggle.com. The data collected by “National Institute of Diabetes and Digestive and Kidney Diseases” contains vital parameters of diabetes patients belong to Pima Indian heritage.
Here is a glimpse of the first ten rows of the data set. I have imported the data set in CSV format using the usual pd.read_csv() command. And to show the table use dtale.show().
The data set has independent variables as several physiological parameters of a diabetes patient. The dependent variable is if the patient is suffering from diabetes or not. Here the dependent column contains binary variable 1 indicating the person is suffering from diabetes and 0 he is not a patient of diabetes.
Data exploration with D-tale
Now you have the Jupyter notebook displaying the data. And you can click the arrow button on the top left-hand corner to open all the data manipulation tools. See the below image the left pan has several options like describe, build column, correlations, charts etc.
Descriptive statistics for variables
This is to describe variables showing some descriptive or summary statistics. It does the same task as df.describe() of pandas does. D-tale enables you to get the same result without writing any code, just click the “describe” from the left panel.
In the below image you can show the descriptive statistics of the variable “Pregnancies” has been displayed along with a box-whisker plot. Select any other variable from the left menu and the summary statistics of that particular variable will be displayed.
Calculation of correlation among the variables
Here is an example of calculating the correlations among the variables. You can see that just on clicking the correlation D-tale has created a beautiful correlation table among all the variables. The depth of colours is very useful to indicate the correlated variables on a glimpse. Like here the dark shade indicated higher correlation.
Preparing charts
Chart creation is a very basic yet very useful data exploration technique. Using D-tale you can create different types of charts like Bar, Line, Scatter, Pie, Heatmap etc. D-tale through its interface has done away writing of several lines of codes. Below is an example of creating a scatter plot with this tool.
As you select the chart option from the left panel of D-tale, a new tab in the browser will open with the following options. You need to select the variables you want to create a scatter plot. There are options to choose X and Y variables. You can also use group by the option to select if there is any categorical variable.
If you desire, also select any of the aggregation options available there, or simply go for the scatter option above. A scatter plot between the two variables will be displayed. Below is the scatter plot with all the options for your reference.
The scatter plot contains some tool options as shown in the above image. These tools help you to further dig into the plot’s details. You can investigate any particular point of interest with options like box select or lasso select, change axes setting, to see the data on hover etc.
Other very helpful options to use the chart created here are available as shown in the figure. Like option to pop up the chart in another tab and compare to the another, a link just copy and share, exporting the chart in static HTML which can be attached with e-mail, data export in CSV and finally allows you to copy the Python code to make further customization.
Highlighting the outliers
Another very good and useful feature of D-tale is to highlight the variable wise outliers in a single click. See the below image.
Creating a pie chart
Here is an example of a Pie chart created with D-tale. Pie chart is also a very popular chart format to show proportional distribution of different components. Creating pie chart follows the same simple process. Just choose pie chart and then select variables you want to display.
Bar plot
Another popular chart format is bar plot. It reveals many important properties of the variables and relation between them. For example here I have created a bar plot between the mean blood pressure against age of different individual. It is an very effective way to know how the blood pressure varies with the age of person. Which is otherwise not easily identifiable from the raw data.
Creating the bar plot is the same and very easy. Here also different aggregation options available. For example I have chosen mean to display the blood pressure along the Y axis.
Code export
It is a very useful option D-tale provides. You can get the code for the particular data exploration technique used by you. Now you can make any desired change or simply understand how to write a standard code for learning purpose.
Here is the code snippet below used for creating the bar plot above.
Conclusion
This article presents a very helpful data exploration tool which can make your regular data analysis task much easier and quicker. It is a light application and uses Pandas data manipulation libraries underneath.
Its simple and neat user interface gets easily integrated with any IDE you use. Data analysts need a quick idea about the data in hand so that they can plant their advance analytical tasks. So, D-tale can be a tool of choice for them saving considerable time required for writing regular repetitive lines of code.
Wish the article will be helpful. I tried to provide as much information as possible so that you can straightway install and apply it. Do share your experience, how do you find it, is it helpful? Let me know your opinion, any further queries or doubt by commenting below.
Machine learning and data science are two major key words of recent times almost all fields of science depend on. If data science is inevitable to explore the knowledge hidden in the data then machine learning is something bringing evolution through feature engineering. But the question is are they very different? In this article, these two fields will be discussed point by point where they are different and if there are any similarities.
The Venn Diagram of sciences
I got a good representation of how data science overlaps with the machine learning domain through Venn diagram from this website. Drew Conway in 2010 gave this concept.
Now with these Venn Diagram structure, the association of all these fields are pretty clear. The lowermost circle essentially indicates the domain knowledge of a particular field. For example, it may be a field of agriculture crop production or population dynamics etc. A data scientist should know about his particular domain besides core knowledge of programming and statistics/mathematics.
Further you can see that data science is common to all three domains. Whereas machine learning lies in the intersection of statistics, mathematics knowledge and the sphere of hacking skill. The major difference between these two lies here. Data science being a more broad concept, requires special subject knowledge to analysis. Where as Machine Learning is more coding and programming oriented field.
Lets dive into elaborate discussion of these differences…
Domain differences
To start with let’s be clear about their domains. Data science is a much bigger term. It comprises multiple disciplines like information technology, modelling and business management. Whereas machine learning is comparatively a specific terms common in data science where the algorithm learns from the data.
Unlike data science, machine learning is more practical than empirical. Data science has much more extensive theoretical base and amenable to mathematical analysis. machine learning, on the other hand, is mainly a computer program based needs coding skills.
As we have seen in the above Venn Diagram, data science and machine learning have common uses. Data science uses the tools of machine learning to study transactional data for useful prediction. Machine learning helps in pattern discovery from the data.
Machine learning is actually learning from the data. The historical data trains the machine learning algorithms to make an accurate prediction. Such a learning process is called supervised learning.
There are situations where no such training data available. Then there are some machine learning algorithms which works without training. This type of machine learning is known as unsupervised machine learning. Obviously the accuracy here is less than the supervised one. But here the situation is also different.
Another kind of machine learning is known as Reinforcement learning. This one is the most advanced and popular machine learning. Here is also the training data is absent and the algorithm learns from its experience.
Deep learning is again a special field of machine learning. Lets discuss briefly about it too.
Deep learning is a subfield of Machine learning which is again a subfield of Artificial Intelligence. In this context deep learning deals with the data as machine learning does; the difference lies in the learning process. Scalability is also a point where these two processes are different from each other.
Deep learning especially a superior method when the data in hand is very vast. Deep learning is very efficient in taking benefit of the large data set. Contrary to machine learning models or other conventional regression models, where the models’ accuracy does not increase after a certain level. The deep learning algorithm goes on improving the model by training it with more and more data.
The deep learning process is a black box method. That means we will only see the inputs and the output. What is going on in between, how does the network work remains obscure.
The name deep learning actually refers to the hidden layers of the training process. The backpropagation algorithm takes the feedback signal from the output to adjust the weights used in the hidden layers and refines the output in the next cycle. This process goes on until we get a satisfactory model.
Data science
We can consider data science as a bridge between the traditional statistical and mathematical science and their application to solving real-world problems. The theoretical knowledge of basic sciences many times remains unused. Data science makes this knowledge applicable to solve practical problems.
More lightly, we can say that a data scientist must have more programming skill than most of the scientists and more statistical skill than a programmer has. No surprise that just mention of data science in anyone’s CV makes him eligible for an enhanced pay package.
Since in almost all organizations are generating data in an exponential amount, they need data scientists to get meaningful insights out of that. Moreover, after the explosion of internet users, the data generated online is enormous. Data science applies data modelling and data warehousing to keep track of this ever-growing data.
Necessary skills to be a data scientist
A data scientist needs to be proficient in both theoretical concepts as well as programming languages like R and Python etc. One person with a good understanding of the underlying statistical concepts can only develop a sound algorithm for its implementation.
But a data scientist’s job does not end here. These two core subjects knowledge is essential no doubt. But to become a successful data scientist a person must provide a complete business solution. When any organisation appoints some data scientist, they are supposed to analyse the data to gain insights about the potential business opportunities and provide the roadmap.
So, a data scientist should also possess knowledge of the particular business domain and communication skill. Without effective communication and result interpretation, even a good analytical report may lead to a disappointing result. So none of these four pillars of success is less important.
The four pillars of data science
I got a good representation of these four pillars of data science through Venn diagram from this website which was originally created by David Taylor a Biotechnologist in his article “Battle of Data Science Venn Diagrams”.
These four different streams are considered as the pillars of data science. But why so? Let’s take real-world examples to understand how data science plays an important role in our daily lives.
Example 1: online shopping
Think about your online shopping experience. Whenever you log in your favourite online shopping platform you get deals on items you like. Items are organized according to your interest. Have you ever thought how on the earth the website does that?
Every time when you visit the online retail website search for things of your interest or purchase something; you generate data. The website stores historical data of your interaction with the shopping platform. If anyone with data science skill analyses the data properly he may know about your purchase behaviour even better than you.
Example 2: Indian Railways
Indian Railways is the fourth-largest network in the world. Every day thousands of trains are operated through which crores of passengers travel across the country. It has a track length over 70,000 km.
So, quite such a vast network generates a huge amount of data every day. The ticket booking system, train operation, biometrics, crew management, train schedule in every aspect the data generated is big data. And if we consider the historical data is no less than a gold mine of information on Indian passengers’ travel trend over the years.
Application of data science on this big data reveals very important information to enable the authority to take accurate decisions about during which season there is a rush of passengers and additional trains need to run; which routes are profitable, running special trains and many more.
So in a nutshell, the main tasks of data science are:
Filtering the required data from big data
Cleaning the raw data to make it amenable to analysis
Data visualization
Data analysis
Interpretation and valid conclusion
Differences
As we discussed all of them at length, we came to know that in spite of many similarities these two subjects have some differences in their application. So, now its time to point out the specific differences between machine learning and data science. Herre they are:
Data science
Machine Learning
Based on extensive theoretical concepts of statistics and mathematics
Knowledge of computer programming and computer science fundamentals are essential
Generally performs various data operations
It is a subset of Artificial Intelligence
Gives emphasis on data visualization
Data evaluation and modelling is required for the feature engineering
It extracts insights from the data by cleaning, visualizing and interpreting data
It learns from data and finds out the hidden pattern
Knowledge of programming languages like R, Python, SAS, Scala etc. is essential
Knowledge of probability and statistics is essential
A data scientist should have knowledge of machine learning
Requires in-depth knowledge of programming skills
Popular tools use in data science are like Tableau, Matlab, Apache Spark etc.
Popular tools used in machine learning are like IBM Watson studio, Microsoft azure ML studio etc.
Structured and unstructured data are the key ingredients
Here statistical models are the key players
It has its applications in fraud detection, trend prediction, credit risk analysis etc.
Image classification, speech recognition, feature extraction are some popular application of machine learning
Difference between data science and machine learning
Conclusion
To end with I would like to summarize the whole discussion saying that, data science is a comparatively newer field of science and of great demand across the organizations. Mainly because of its immense power of providing insights analyzing big data which otherwise has no meaning to the organisations.
On the other hand machine learning is an approach which enables the computer to learn from the data. A data scientist should have the knowledge of machine learning in order to unravel its full potential. So, they do have some overlapping parts and complimentary skills.
I hope the article contains sufficient discussion to make you understand the similarity as well as difference between machine learning and data science. If you have any question, doubt please comment below. I would like to answer them.
Comparing different machine learning models for a regression problem is necessary to find out which model is the most efficient and provide the most accurate result. There are many test criteria to compare the models. In this article, we will take a regression problem, fit different popular regression models and select the best one of them.
We have discussed how to compare different machine learning problems when we have a classification problem in hand (the article is here). That means in such cases the response variable is a categorical one. Different popular classification algorithms are compared to come out with the best algorithm.
NB: Being a non-native English speaker, I always take extra care to proofread my articles with Grammarly. It is the best grammar and spellchecker available online. Read here my review of using Grammarly for more than two years.
So, what if the response variable is a continuous one and not categorical. This is a problem of regression then and we have to use regression models to estimate the predicted values. In this case also several candidate regression models. Our task is to find the one which serves our purpose.
So, in this article, we are taking a regression problem of predicting the value of a continuous variable. We will compare several regression models, compare their performance calculating the prediction accuracy and several goodnesses of fit statistics.
Here I have used five most prominent and popular regression models and compared them according to their prediction accuracy. The supervised models used here are
The models were compared using two very popular model comparison metrics namely Mean Absolute Error(MAE) and Mean Square Error (MSE). The expressions for these two metrics are as below:
Mean Absolute Error(MAE)
Comparing different machine learning models for a regression problem involves an important part of comparing original and estimated values. If is the response variable and is the estimate then MAE is the error between these pair of variables and calculated with this equation:
MAE is a scale-dependent metric that means it has the same unit as the original variables. So, this is not a very reliable statistic when comparing models applied to different series with different units. It measures the mean of the absolute error between the true and estimated values of the same variable.
Mean Square Error (MSE)
This metric of model comparison is as the name suggests calculate the mean of the squares of the error between true and estimated values. So, the equation is as below:
Python code for comparing the models
So, now the comparison between different machine learning models is conducted using python. We will see step by step application of all the models and how their performance can be compared.
Loading required libraries
All the required libraries are first loaded here.
import numpy as np # linear algebra
import pandas as pd # data processing
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn import metrics
from pandas import DataFrame,Series
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
import matplotlib
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.model_selection import train_test_split,cross_val_score, cross_val_predict
import missingno as msno # plotting missing data
import seaborn as sns # plotting library
from sklearn import svm
The example data and its preprocessing
The data set used here is the car data set from Github and you can access the data file from this link. The data set has the following independent variables:
Age
Gender
Average miles driven per day
Personal debt and
Monthly income
Based on these independent variables we have to predict the potential sale value of a car. So, here the response variable is the sale value of the car and it is a continuous variable. That is why the problem in hand is a regression problem.
Importing the data
The below piece of code will use the panda library read() function to import the data set into the working space. The describe() function is for a brief idea about the data.
Displaying the last few columns of the data set to have a glimpse of the data and variables.
Check the data for missing values
The following code is to check if there any missing value in the data set. Missing value creates a problem in the analysis process. So, we should filter these values in the data pre-processing stage. Here we will find out which columns contain missing values and the corresponding rows will be simply dropped from the data set.
# Finding all the columns with NULL values
dataset.isna().sum()
# Drop the rows with missing values
dataset = dataset.dropna()
Creating basic plots with the data
Here we create the joint distribution plot of the independent variables
Data splitting is required to create training and testing data sets from the same car data. I have taken 80% of the whole data set as training data and the rest 20% of data as the test data set. The following python code is for this splitting purpose.
First of all we will see the summary statistics of all the variables using the describe() function of sklearn library.
# Calculating basic statistics with the train data
train_stats = train_dataset.describe()
train_stats.pop("sales") # excluding the dependent variable
train_stats = train_stats.transpose()
train_stats
Here from the below stats about the data set, we can see that different variables in the data set has very large range and deviations, which may create problem during model fitting. So, before we use this variables in model building process, we will normalize the variables.
Creating a function for normalization
Using the mean and standard deviation of each of the variables, we will convert them into standard normal variates. For that purpose, we will create the function as in below.
# Creating the normalizing function with mean and standard deviation
def norm(x):
return (x - train_stats['mean']) / train_stats['std']
normed_train_data = norm(train_dataset)
normed_test_data = norm(test_dataset)
Separating the response variable and creating other variables
Now a most important step to store the response variable in a separate variable.
train_labels = train_dataset.pop("sales") # using .pop function to store only the dependent variable
test_labels = test_dataset.pop("sales")
x_train=normed_train_data
x_test=normed_test_data
y_train=train_labels
y_test=test_labels
As we are now finished with the data pre-processing stage, we will start with the modelling steps. So, let’s start coding for all the five models I have mentioned to predict the car sale price.
First of all Multiple Linear Regression (MLR). This simple linear regression only but we will include all the independent variables to estimate the car sale price. The LinearRegression() function from LinearModel module of sklearn library has been used here for the purpose.
lin_reg = LinearRegression()
lin_reg.fit(x_train,y_train)
#Prediction using test set
y_pred = lin_reg.predict(x_test)
mae=metrics.mean_absolute_error(y_test, y_pred)
mse=metrics.mean_squared_error(y_test, y_pred)
# Printing the metrics
print('R2 square:',metrics.r2_score(y_test, y_pred))
print('MAE: ', mae)
print('MSE: ', mse)
Here is the deep learning model mentioned. A sequential model has been used. The model has been created as a function named build_model so that we can call it anytime it is required in the process. The model has two connected hidden layers with a Rectified Linear Unit (relu) function and an output layer with a linear function.
The hidden layers have 12 and 8 neurons respectively with all the 8 input variables. Mean Squared Error is the loss function here as it is the most common loss function in case of regression problems.
This part of code is to show the summary of model we built. All the specifications mentioned above has been shown in the below screenshot of the output.
model.summary()
Training the model
We have used 10 rows of the training data set to check the model performance. As the result seems satisfactory so, we will proceed with the same model.
Plotting the MAE score during the training process
As we are using 1000 epochs to train the model. It necessarily suggests that there are 1000 forward and backward passes while the model is trained. And we expect that with each passes the the loss will decrease and model’s prediction accuracy will increase as the training process progresses.
Here we have plotted the predicted sale prices against the true sale prices. And from the plot it is clear that the estimate is quite close to the original one.
Here we have plotted the error. Although the distribution of error is not a true Gaussian, but as the sample size increases, we can expect it will tend to a Gaussian distribution.
So, here we can compare the performance of all the models using the metrics calculated. Let’s see all the models used to predict the car sale price together along with the metrics for the ease of comparison.
Model type
MAE
R Square
MLR
2821
0.80
Decision Tree
2211
0.84
Random Forest
1817
0.88
Support Vector Machine
7232
0
Deep learning/ANN
2786
0.8
Comparison table for all the models used
From the table above, it is clear that for the present problem, the best performing model is Random Forest with the highest R square (Coefficient of Determination) and least MAE. But we have to keep in mind that the deep learning is also not far behind with respect to the metrics. And the beauty of deep learning is that with the increase in the training sample size, the accuracy of the model also increases.
Whereas in case of other models after a certain phase it attains a plateau in terms of model prediction accuracy. Even increasing training sample size also can not further improve the model’s performance. So, although deep learning occupies the third position in present situation, it has the potential to improve itself further if availability of training data is not a constrain.
If the data set is small and we need a good prediction for the response variable as the case here; it is a good idea to go for models like Random Forest or Decision tree. As they are capable of generating good prediction with lesser training data or labelled data.
So, finally it is the call of the researcher or modeler to select the best suited model judging his situation and field of knowledge. As different fields of science generate experimental data with distinct nature and a very good model of another field may fail completely to predict.
The Naive Bayes classifier is very straight forward, easy and fast working machine learning technique. It is one of the most popular supervised machine learning techniques to classify data set with high dimensionality. In this article, you will get a thorough idea about how this algorithm works and also a step by step implementation with python. Naive Bayes’ actually a simplified form of Bayes’ theorem so we will cover that too.
“Under Bayes’ theorem, no theory is perfect, Rather, it is a work in progress, always subject to further refinement and testing.” ~ Nate Silver
In real life application of classification problem is everywhere. We are taking different decisions in our daily life judging probability of several other factors either consciously or unconsciously. When we are in need to analyse large data and take a decision on its basis, we need some tool. Naive Bayes classifier is the simplest and very fast supervised learning algorithm which is also accurate enough. So, it can make our life far easier in taking vital decisions.
The concept of Bayes’ theorem
To know the Naive Bayes’ classification concept we have to understand the Bayes’ theorem first. A Bayesian classification describes the relationship between conditional probabilities of different events. This theorem calculates the probability of any hypothesis provided the information of any event.
For example, we the cricket lovers try to guess whether we will be able to play today depending on the weather variables. A banker tries to make sure if the customer is risky to give a credit depending on his financial transaction history or a businessman tries to judge whether his newly launched product is going to be a hit or flop among the customer depending on the customers buying behaviour.
This type of model dealing with conditional probabilities is called generative models. They are generative because of the fact they actually specify the hypothetical random process of data generation. But the training of such generative models for each event is really very difficult task.
So how to tackle this issue? Here comes the concept of Naive Bayes’ classifier. The name Naive because it assumes some very simple things about the Bayes’ model. Like the presence of any feature in any class does not depend on any other feature. It simply overlooks the relationship between the features and considers that all the features independently contributes toward the target variable.
In the data set, the feature variable is test report having values as positive and negative whereas the binomial target variable is “sick” with values as “yes” or “no”. Let us assume the data set has 20 cases of test results which are as below:
Creating a frequency table of the attributes of the data set
So if we create the frequency table for the above data set it will look like this
With the help of this frequency table, we can now prepare the likelihood table to calculate prior and posterior probabilities. See the below figure.
With the help of this above table, we can now calculate what is the probability that a person is really suffering from a disease when his test report was also positive.
So the probability we want to compute is
We have already calculated the probabilities, so, we can directly put the values in the above equation and get the probability we want to calculate.
In the same fashion, we can also calculate the probability of a person of not having a disease in spite of the test report being positive.
Application of Naive Bayes’ classification with python
Now the most interesting part of the article is here. We will implement Naive Bayes’ classification using python. To do that we will use the popular scikit-learn library and its functions.
About the data
We will take the same diabetes data we have used earlier in other classification problem.
The purpose of using the same data for all classification problems is to make you able to compare between different algorithms. You can judge the accuracy of each algorithm with their accuracies in classifying the data.
So, here the target variable has two classes that is if the person has diabetes or not. On the other hand, we have 9 independent or feature variables influencing the target variable.
Importing required libraries
The first step to start coding is to import all the libraries we are going to use. The basic libraries for any kind of data science projects are like pandas, numpy, matplotlib etc. The purpose of these libraries has an elaborate discussion in the article simple linear regression with python.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.naive_bayes import GaussianNB
import seaborn as sns
About the data
The example dataset I have used here for demonstration purpose is from kaggle.com. The data collected by “National Institute of Diabetes and Digestive and Kidney Diseases” contains vital parameters of diabetes patients belong to Pima Indian heritage.
Here is a glimpse of the first ten rows of the data set:
The data set has independent variables as several physiological parameters of a diabetes patient. The dependent variable is if the patient is suffering from diabetes or not. Here the dependent column contains binary variable 1 indicating the person is suffering from diabetes and 0 he is not a patient of diabetes.
# Printing data details
print(dataset.info) # for a quick view of the data
print(dataset.head) # printing first few rows of the data
dataset.tail # to show last few rows of the data
dataset.sample(10) # display a sample of 10 rows from the data
dataset.describe # printing summary statistics of the data
pd.isnull(dataset) # check for any null values in the data
Creating variables
As we can see that the data frame contains nine variables in nine columns. The first eight columns contain the independent variables. These are some physiological variables having a correlation with diabetes symptoms. The ninth column shows if the patient is diabetic or not. So, here the x stores the independent variables and y stores the dependent variable diabetes count.
x=dataset.iloc[:,: -1]
y=dataset.iloc[:,-1]
Splitting the data for training and testing
Here we will split the data set in training and testing set with 80:20 ratio. We will use the train_test_split function of the scikit-learn library. The test_size mentioned in the code decides what proportion of data will be kept aside to test the trained model. The test data will remain unused in the training process and will act as an independent data during testing.
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test=train_test_split(x,y, test_size=0.2, random_state=0)
Fitting the Naive Bayes’ model
Here we fit the model with the training set.
model=GaussianNB()
model.fit(x_train,y_train)
Using the Naive Bayes’ model for prediction
Now as the model has been fitted using the training set, we will use the test data to make prediction.
y_pred=model.predict(x_test)
Checking the accuracy of the fitted model
As we already have the observations corresponding to the test data set, so, we can compare that with the prediction to check how accurate the model’s prediction is. Scikit-learn’s metrics module has the function called accuracy_score which we will use here.
from sklearn import metrics
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
Conclusion
So, we have completed the whole process of applying Naive Bayes’ classification using python and also we are now through its basic concepts. It will be a little confusing at first. As you solve more practical problems with this application you will become more confident.
This particular classifying technique is actually based on the Bayesian classification method. The name “Naive” it gets due to its oversimplification of the original Bayes theorem. The Naive Bayes classifier assumes that each pair of features has the conditional independence given the value of the target variable.
The Naive Bayes classifier can be a good choice for all types of classification problem be it binomial or multinomial. The algorithms extremely fast and straightforward technique can help us to take a quick decision. If the result of this classifier is accurate enough (which is the most common case) then it’s fine otherwise we can always take help of other classifiers like decision tree or random forest etc.
So, I hope this article will help you gain an in-depth knowledge about Naive Bayes’ theory and its application to solve real-world problems. In case of any doubt or queries please let me know through comments below.
Comparing Machine Learning Algorithms (MLAs) are important to come out with the best-suited algorithm for a particular problem. This post discusses comparing different machine learning algorithms and how we can do this using scikit-learn package of python. You will learn how to compare multiple MLAs at a time using more than one fit statistics provided by scikit-learn and also creating plots to visualize the differences.
Machine Learning Algorithms (MLA) are very popular to solve different computational problems. Especially when the data set is huge and complex with no parameters known MLAs are like blessings to data scientists. The algorithms quickly analyze the data to learn the dependencies and relations between the variables and produce estimation with lot more accuracy than the conventional regression models.
Most common and frequently used machine learning models are supervised models. These models tend to learn about the data from experience. Its like the labelled data acts as teacher to train it to be perfect. As the training data size increases the model estimation gets more accurate.
Here are some recommended articles to know the basics of machine learning
NB: Being a non-native English speaker, I always take extra care to proofread my articles with Grammarly. It is the best grammar and spellchecker available online. Read here my review of using Grammarly for more than two years.
Other types of MLAs are the unsupervised and semi-supervised type which are helpful when the training data is not available and still we have to make some estimation. As these models are not trained using labelled data set naturally, these algorithms are not as accurate as supervised ones. But still, they have their own advantages.
All these MLAs are useful depending on situations and data types and to have the best estimation. That’s why selecting a particular MLA is essential to come with a good estimation. There are several parameters which we need to compare to judge the best model. After that, the best found model need to be tested on an independent data set for its performance. Visualization of the performance is also a good way to compare between the models quickly.
So, here we will compare most of the MLAs using resampling methods like cross validation technique using scikit-learn package of python. And then model fit statistics like accuracy, precision, recall value etc will be calculated for comparison. ROC (Receiver Operating Characteristic) curve is also a easy to understand process for MLA comparison; so finally in a single figure all ROCs will be put to for the ease of model comparison.
Data set used
The same data set used here for application of all the MLAs. The example dataset I have used here for demonstration purpose is from kaggle.com. The data collected by “National Institute of Diabetes and Digestive and Kidney Diseases” contains vital parameters of diabetes patients belong to Pima Indian heritage.
Here is a glimpse of the first ten rows of the data set:
The data set has independent variables as several physiological parameters of a diabetes patient. The dependent variable is if the patient is suffering from diabetes or not. Here the dependent column contains binary variable 1 indicating the person is suffering from diabetes and 0 he is not a patient of diabetes.
Code for comparing different machine learning algorithms
Lets jump to coding part. It is going to be a little lengthy code and a lot of MLAs will be compared. So, I have break down the complete code in segments. You can directly copy and pest the code and make little changes to suit your data.
Importing required packages
The first part is to load all the packages needed in this comparison. Besides the basic packages like pandas, numpy, matplotlib we will import some of the scikit-learn packages for application of the MLAs and their comparison.
#Importing basic packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#Importing sklearn modules
from sklearn.metrics import mean_squared_error,confusion_matrix, precision_score, recall_score, auc,roc_curve
from sklearn import ensemble, linear_model, neighbors, svm, tree, neural_network
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn import svm,model_selection, tree, linear_model, neighbors, naive_bayes, ensemble, discriminant_analysis, gaussian_process
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
Importing the data set and checking if there is any NULL values
This part of code will load the diabetes data set and check for any null values in the data frame.
#Loading the data and checking for missing values
dataset=pd.read_csv('diabetes.csv')
dataset.isnull().sum()
Checking the data set for any NULL values is very essential, as MLAs can not handle NULL values. We have to either eliminate the records with NULL values or replace them with the mean/median of the other values. we can see each of the variables are printed with number of null values. This data set has no null values so all are zero here.
Storing the independent and dependent variables
As we can see that the data frame contains nine variables in nine columns. The first eight columns contain the independent variables. These are some physiological variables having a correlation with diabetes symptoms. The ninth column shows if the patient is diabetic or not. So, here the x stores the independent variables and y stores the dependent variable diabetes count.
# Creating variables for analysis
x=dataset.iloc[:,: -1]
y=dataset.iloc[:,-1]
Splitting the data set
Here the data set has been divided into train and test data set. The test data set size is 20% of the total records. This test data will not be used in model training and work as an independent test data.
# Splitting train and split data
x_train, x_test, y_train, y_test=train_test_split(x,y,test_size=0.2, random_state=0)
Storing machine learning algorithms (MLA) in a variable
Some very popular MLAs we have selected here for comparison and stored in a variable; so that they can be used at later part of the process. The MLAs first we have taken up for comparison are Logistic Regression, Linear Discriminant Analysis, K-nearest neighbour classifier, Decision tree classifier, Naive-Bayes classifier and Support Vector Machine.
This part of code creates a box plot for all the models against their cross validation score.
# evaluate each model in turn
results = []
names = []
scoring = 'accuracy'
for name, model in models:
kfold = model_selection.KFold(n_splits=10, random_state=seed)
cv_results = model_selection.cross_val_score(model, x_train, y_train, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
# boxplot algorithm comparison
fig = plt.figure()
fig.suptitle('Comparison between different MLAs')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
The cross validation score are printed below and it is clearly suggesting Logistic Regression and Linear Discriminant Analysis to the two most accurate MLAs.
# Creating plot to show the train accuracy
plt.subplots(figsize=(13,5))
sns.barplot(x="MLA used", y="Train Accuracy",data=MLA_compare,palette='hot',edgecolor=sns.color_palette('dark',7))
plt.xticks(rotation=90)
plt.title('MLA Train Accuracy Comparison')
plt.show()
# Creating plot to show the test accuracy
plt.subplots(figsize=(13,5))
sns.barplot(x="MLA used", y="Test Accuracy",data=MLA_compare,palette='hot',edgecolor=sns.color_palette('dark',7))
plt.xticks(rotation=90)
plt.title('Accuraccy of different machine learning models')
plt.show()
# Creating plots to compare precission of the MLAs
plt.subplots(figsize=(13,5))
sns.barplot(x="MLA used", y="Precission",data=MLA_compare,palette='hot',edgecolor=sns.color_palette('dark',7))
plt.xticks(rotation=90)
plt.title('Comparing different Machine Learning Models')
plt.show()
Receiver Operating Characteristic (ROC) curve is a very important tool to diagnose the performance of MLAs by plotting the true positive rates against the false-positive rates at different threshold levels. The area under ROC curve often called AUC and it is also a good measure of the predictability of the machine learning algorithms. A higher AUC is an indication of more accurate prediction.
# Creating plot to show the ROC for all MLA
index = 1
for alg in MLA:
predicted = alg.fit(x_train, y_train).predict(x_test)
fp, tp, th = roc_curve(y_test, predicted)
roc_auc_mla = auc(fp, tp)
MLA_name = alg.__class__.__name__
plt.plot(fp, tp, lw=2, alpha=0.3, label='ROC %s (AUC = %0.2f)' % (MLA_name, roc_auc_mla))
index+=1
plt.title('ROC Curve comparison')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.plot([0,1],[0,1],'r--')
plt.xlim([0,1])
plt.ylim([0,1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()
Conclusion
This post presents a detailed discussion on how we can compare several machine learning algorithms at a time to fund out the best one. The comparison task has been completed using different functions of scikit-learn package of python. We took help of some popular fit statistics to draw a comparison between the models. Additionally, the Receiver Operating Characteristic (ROC) is also a good measure of comparing several MLAs.
I hope this guide will help you to conclude your problem in hand and to proceed with the best MLA chosen through a rigorous comparison method. Please feel free to try the python code given here, copy-pest the code in your python compiler, run and apply on your data. In case of any problem faced in executing the comparison process write me in the comment below.