What is web scraping in data science?
Page contents
Web scraping, also known as web harvesting or screen scraping or web data extraction is a way of collecting a large amount of data from the internet. In data science, especially in machine learning, the accuracy of the model is largely dependent on the amount of data you have. A large amount of data helps to train the model and make it more perfect.
Read this article to know the difference between data science and machine learning.
Across all business domains, data plays a crucial role to decide strategies, competitor price monitoring, consumer sentiment analysis, extracting financial statements etc. Be it a small business owner or business tycoons, the market data and analytics is something they always need to keep a tab to survive the cutthroat competition. Every single decision they take towards business expansion is driven by the data.
Web scraped data collected from diverse sources enables us to get real-time analytics i.e. data gets analyzed right after the data is available. There are instances where a delayed data analysis report has no use. For example, stock price data analysis needs to be real-time for trading. Customer Relationship Management (CRM) is also an example of real-time data analytics.
Source of data
So, what is the source of such a large amount of data? Obviously the internet. There are a lot of open-source data and also web sites catering to specialised data. Generally, we visit such sites one at a time and search for the information we look for. We put our query and the required information is fetched from the server.
This process is okay until and unless we need data for a data science project. The amount of data required for a satisfactory machine learning model is huge and a single website can not cater much.
Data science projects involve projects like Natural Language Processing (NLP), Image recognition etc. which has revolutionalized artificial intelligence application towards solving our day to day needs and even critical path-breaking scientific achievements. In these cases, web scraping is the most favourite and frequently used tool by data scientists.
Web scraping in data science can be defined as the construction of a computer programme which automatically downloads, parse, organize data from the internet (source: https://www.kdnuggets.com).
Points to remember before you go for web scraping in data science
Now before you go for data scraping from any website, you must double-check if the site allows web scaping. If the website is open-source or categorically mentions that it caters data for private use then no issue. Otherwise, you can check for the robots.txt file of the site. Sometimes the site clearly mentions if they have issues with web scaping from their site.
For example, see the robots.txt file of Facebook. You can check it navigating to the link https://www.facebook.com/robots.txt. There you can see a few lines at the very beginning of the file which categorically mentions “collection of data on Facebook through automated means is prohibited unless you have express written permission from Facebook”.
So, checking the robots.txt is also an effective way of checking if data scraping is at all allowed by the website you want to scrape.
Web scraping can be accomplished both using web APIs or tools like BeautifulSoup. BeautifulSoup is a class specially made for web scraping and available with bs4 package. It is a very helpful package and saves programmers a lot of time. It helps to collect data from HTML and XML files.
Let’s try a very basic web scraping code using the BeautifulSoup class of bs4 package of Python.
A practical example of data scraping
Lets take a practical example where we can scrap a data table from a webpage. We will take the URL of this page itself and try to scrape the below table. It is an imaginary example table containing the age, height and weight data of 20 random persons.
Name | Gender | Age | Height | Weight |
Ramesh | Male | 18 | 5.6 | 59 |
Dinesh | Male | 23 | 5.0 | 55 |
Sam | Male | 22 | 5.5 | 54 |
Dipak | Male | 15 | 4.5 | 49 |
Rahul | Male | 18 | 5.9 | 60 |
Rohit | Male | 20 | 6.0 | 69 |
Debesh | Male | 25 | 6.1 | 70 |
Deb | Male | 21 | 5.9 | 56 |
Debarati | Female | 29 | 5.4 | 54 |
Dipankar | Male | 22 | 5.7 | 56 |
Smita | Female | 25 | 5.5 | 60 |
Dilip | Male | 30 | 4.9 | 49 |
Amit | Male | 14 | 4.8 | 43 |
Mukesh | Male | 26 | 5.1 | 50 |
Aasha | Female | 27 | 4.7 | 51 |
Dibakar | Male | 22 | 5.3 | 55 |
Manoj | Male | 33 | 6.2 | 75 |
Vinod | Male | 27 | 5.2 | 54 |
Jyoti | Female | 22 | 5.9 | 65 |
Suppose this table we want to use in our data science project. So, how we can bring the data in a usable format. This table is just an example, and usually, you will find tables with thousands of rows and the number of web pages with such tables. But the process of scraping data will be the same.
Let’s try to scrape this small table using the bs4 library of Python. It stands for BeautifulSoup version 4. The bs class defines the basic interface called by tree builders.
Importing required libraries
The two special libraries we will need here are BeautifulSoup and requests for scraping information and grabbing the URL content.
# Importing required libraries
import requests
from bs4 import BeautifulSoup as bs # defines the basic interface called by the tree builders
In this section we are importing other basic important libraries like pandas, numpy, matplotlib, seaborn etc.
# Importing other important libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Accessing the web pages and scraping the content
To open this particular page here I used openurl function of the library urllib through its request module. And then passing the html content with the BeautifulSoup function. Don’t bother about the ‘lxml‘ part right now. You will get to know about it later on.
# Opening the particular url using openurl function
from urllib.request import urlopen
url='https://dibyendudeb.com/what-is-web-scraping-and-why-it-is-so-important-in-data-science/'
html= urlopen(url)
soup=bs(html,'lxml')
Now the BeautifulSoup has a very useful function called find_all to find all the HTML content with a particular tag. You can explore these tags from the inspect option when you do right-click on any part of a web page.
See below the images to understand the way you can identify particular HTML tag of any specific web page content.
The records of this table are with tr and td tags. Which clearly indicates that we need to apply the find_all function with these two tags.
So let’s apply find_all to get all the values with these two tags from this web page and then create string objects with them. The resultant content we have stored here in a list.
Creating list of web scraped content
# to print all rows in a table
records=soup.find_all('tr')
# creating list with the text
text_list=[]
for row in records:
row_store=row.find_all('td')
text_store=str(row_store) # creating a string object from the given object
onlytext=bs(text_store,'lxml').get_text() # using BeautifulSoup method to collect the text as a list using get_text() function
text_list.append(onlytext)
In the next step, we need to create a data frame from this list to make the data ready for further analysis. Print the data frame to see the records of the table within the HTML tags we mentioned in the code.
df=pd.DataFrame(text_list)
df.head(10)
print(records)
[<tr><td><strong>Name</strong></td><td><strong>Gender</strong></td><td><strong>Age</strong></td><td><strong>Height</strong></td><td><strong>Weight</strong></td></tr>, <tr><td>Ramesh</td><td>Male</td><td>18</td><td>5.6</td><td>59</td></tr>, <tr><td>Dinesh</td><td>Male</td><td>23</td><td>5.0</td><td>55</td></tr>, <tr><td>Sam</td><td>Male</td><td>22</td><td>5.5</td><td>54</td></tr>, <tr><td>Dipak</td><td>Male</td><td>15</td><td>4.5</td><td>49</td></tr>, <tr><td>Rahul</td><td>Male</td><td>18</td><td>5.9</td><td>60</td></tr>, <tr><td>Rohit</td><td>Male</td><td>20</td><td>6.0</td><td>69</td></tr>, <tr><td>Debesh</td><td>Male</td><td>25</td><td>6.1</td><td>70</td></tr>, <tr><td>Deb</td><td>Male</td><td>21</td><td>5.9</td><td>56</td></tr>, <tr><td>Debarati</td><td>Female</td><td>29</td><td>5.4</td><td>54</td></tr>, <tr><td>Dipankar</td><td>Male</td><td>22</td><td>5.7</td><td>56</td></tr>, <tr><td>Smita</td><td>Female</td><td>25</td><td>5.5</td><td>60</td></tr>, <tr><td>Dilip</td><td>Male</td><td>30</td><td>4.9</td><td>49</td></tr>, <tr><td>Amit</td><td>Male</td><td>14</td><td>4.8</td><td>43</td></tr>, <tr><td>Mukesh</td><td>Male</td><td>26</td><td>5.1</td><td>50</td></tr>, <tr><td>Aasha</td><td>Female</td><td>27</td><td>4.7</td><td>51</td></tr>, <tr><td>Dibakar</td><td>Male</td><td>22</td><td>5.3</td><td>55</td></tr>, <tr><td>Manoj</td><td>Male</td><td>33</td><td>6.2</td><td>75</td></tr>, <tr><td>Vinod</td><td>Male</td><td>27</td><td>5.2</td><td>54</td></tr>, <tr><td>Jyoti</td><td>Female</td><td>22</td><td>5.9</td><td>65</td></tr>]
But this data need to split to create separate records according to the comma-separated values. The following line of codes creates a proper shaped data structure with multiple columns.
df1 = df[0].str.split(',', expand=True)
df1.head(10)
Data refinement
Some more refinement here. You can notice some unwanted braces are present with the records. The following code will fix these issues.
# Removing the opening bracket from the column 0
df1[0] = df1[0].str.strip('[')
# Removing the closing bracket from the column 9
df1[4] = df1[4].str.strip(']')
df1.head(10)
Creating table header
The table has some digits as indices which need to be corrected. We need the first row values as the index values. Let’s do this change. The following few sections first separate the first row in a data frame and then concatenate them to create the final data frame with desired index values step by step.
# Storing the table headers in a variable
headers = soup.find_all('strong')
# Using BeautifulSoup again to arrange the header tags
header_list = []# creating a list of the header values
col_headers = str(headers)
header_only = bs(col_headers, "lxml").get_text()
header_list.append(header_only)
print(header_list)
df2 = pd.DataFrame(header_list)
df2.head()
df3 = df2[0].str.split(',', expand=True)
df3.head()
concatenate = [df3, df1]
df4 = pd.concat(concatenate)
df4.head(10)
After the above step, we now have an almost complete table with us. Though need some more refinement, let’s start with the first step. We need the indices as the header of the table. So, we are here renaming the columns of the data frame.
# Assigning the first row as table header
df4=df4.rename(columns=df4.iloc[0])
df4.head(10)
You can see that the table header here has got replicated as the first record in the table, so we need to correct this problem. Lets drop the first repeated row from the data frame.
# Droping the repeated row from the data frame
df5 = df4.drop(df4.index[0])
df5.head()
So, as we have almost the final table now lets explore the basic information about the data in our hand.
df5.info()
df5.shape
Check for missing value
Although the table we have on this web page does not have any missing value, still to have a check and eliminating any row with missing value is a good practice. Here the dropna function does the trick for us.
# Eliminating rows with any missing value
df5 = df5.dropna(axis=0, how='any')
df5.head()
If you print the columns separately, you can notice some unwanted space and braces are there with the column names. Let’s get rid of them and we are done with refining the data frame. The blank spaces in the column names are clear when we print them.
df5.columns
These white spaces may cause problem when we refer them in the analysis process. So, we need to remove these spaces with the help of following code.
# Some more data refinement to make the dataset more perfect
df5.rename(columns={'[Name': 'Name'},inplace=True)
df5.rename(columns={' Weight]': 'Weight'},inplace=True)
df5.rename(columns={' Height': 'Height'},inplace=True)
df5.rename(columns={' Age': 'Age'},inplace=True)
df5.rename(columns={' Gender': 'Gender'},inplace=True)
print(df5.head())
So, here is the final table with the required information.
Exploring the web scraped data
Here we will explore the basic statistics about the data. The data has two main variables “Weight” and “Height” lets get their description.
#descriptive statistics summary
df5['Weight'].describe()
#descriptive statistics summary
df5['Height'].describe()
Histogram is also a good data exploration technique describing distribution of any variable. We can check if the data is normal or with some deviation.
#histogram
sns.distplot(df5['Height']);
Relationship between the two main variables. We will plot a scatter diagram between the variables height and weight to see how they are correlated.
# Relationship between height and weight using a scatterplot technique
df5.plot.scatter(x='Height', y='Weight',ylim=(0.800000))
Thus we have completed web scraping of a table from a web page. The technique demonstrated above is applicable to all similar cases of data scraping whatever be the size of data. The data gets stored in python data frame and can be used for any kind of analysis.
Web scraping in data science from multiple web pages
Now a more complex and rather practical example of web scraping. Many times we see that the particular information we look for are scattered through more than one pages. In this case, some additional skill is required for data scraping from these pages.
Here we will take such an example from this website itself and try to scrape the titles of all the articles written. This is only one parameter that we want to collect but the information is spread through multiple pages.
Taking one parameter will keep the code less complex and easy to understand. But it is equally effective as the process of scraping data of one parameter and multiple parameters is the same.
The index page with URL https://dibyendudeb.com has a total of five pages containing the list of all the articles the website contains. So, we will navigate through all these pages, grab the article titles in a for loop and scrape the titles using BeautifulSoup method.
Importing libraries
To start with the coding the first steps are as usual importing the required libraries. Except for the regular libraries like pandas, NumPy, matplotlib and seaborn we need to import the specialized libraries for web scraping like BeautifulSoup and requests for grabbing the URL of web pages.
# Importing required libraries
from bs4 import BeautifulSoup as bs # defines the basic interface called by the tree builders
from requests import get
# Importing other important libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Use of BeautifulSoup for web scraping in data science
Here we will try the main scraping part. The most important task we need to do here is to identify the particular HTML content tag. This is a little experience-based skill. The more you scan the HTML content of web pages more you become comfortable to identify them.
As we have mentioned before, we want to store the article titles only. So our goal is to identify the particular HTML tag which wraps the title of an article. Let’s take a look at the home page of this website https://dibyendudeb.com. Select any of the article titles, right-click and select inspect from the option. You will get a view as below.
Here you can see the exact HTML part of the selected web page gets highlighted. It makes our task very easy to identify the particular HTML tag we need to find from the web page. Let’s scan the HTML part more closely to understand the structure of it.
Scanning the HTML code for web scraping
The above image presents the nested HTML code with the title section. You can here identify the particular division class containing the title of the article. So, we now know which HTML tag and the class unique tag we need to find from the scraped HTML content.
Now the below line of code will grab the particular URL and stores all the required information from all the five pages in a raw form. We will scan the whole part of this code line by line and understand functions of a particular piece of code.
# Opening the particular url using openurl function
from urllib.request import urlopen
titles=[]
pages = np.arange(1, 5, 1)
for page in pages:
#read and open url to scrape
page = "https://dibyendudeb.com/page/" + str(page)
html= urlopen(page)
soup=bs(html, 'html.parser')
topics = soup.find_all('div', attrs={'class': 'obfx-grid-col-content'})
for container in topics:
title=container.h1.a.text
titles.append(title)
Creating an empty list
The titles=[] is an empty list is declared first to store the titles. Then we have created a variable called pages to hold the page numbers which are 1,2,3,4 and 5. The np.arrange takes the first value, last value and the interval we want between them.
Use of for loop
After that, we enter into a for loop which iterates through the pages to get the content from all the web pages. Now when we are storing the URLs into the page variable, we need to pass a variable which changes with the change of web pages.
Scanning multiple webpages
If we check the URL of different pages of the web site, you can notice the page no. gets associated with the URL. For example, the URL of page 2 is https://dibyendudeb.com/page/2/. Likewise, page 3 and 4 get the page no. with the URL.
Accessing the URL
So, we need to simply concatenate the page no. with the URL within the for a loop. Which has been done with the code "https://dibyendudeb.com/page/" + str(page)
. We need to convert the page no. into a string as it will be a part of the URL.
html= urlopen(page): With this piece of code the page contents have been stored in the variable html variable. Next, soup=bs(html, 'html.parser')
the BeautifulSoup function parses the HTML content in a variable called soup.
As we now we have the parsed HTML content of all the pages, we need to find only the article titles. The process of finding the article title I have already explained above.
Use of find_all and nested for loop to populate the list
topics = soup.find_all('div', attrs={'class': 'obfx-grid-col-content'})
. Here in this code, the HTML code with required attributes have been stored in the variable named topics
for container in topics:
: This piece of code is another nested for loop part which scans through the content of “
title=container.h1.a.text
titles.append(title)topics
” and stores the exact title in the variable “title
“. And finally, the title gets appended in the “titles"
list.
With this part, we have completed the scraping part. What remains is just refining the data and creating data frame from the list. Lets first check what we have scraped from the website. We are going to print the length and the content of the variable titles.
print(len(titles))
print(titles)
The scraped content
The length is 40. Which is exactly the number of the total articles the website contains. So, we are satisfied that the code has done what we expected from it. The content also confirms it. Here are a few starting lines from the output of the print command.
40 ['\n\t\t\t\t\tDeploy machine learning models: things you should know\t\t\t\t', '\n\t\t\t\t\tHow to create your first machine learning project: a comprehensive guide\t\t\t\t', '\n\t\t\t\t\tData exploration is now super easy with D-tale\t\t\t\t', '\n\t\t\t\t\tHow to set up your deep learning workstation: the most comprehensive guide\t\t\t\t', '\n\t\t\t\t\tWhy Ubuntu is the best for Deep Learning Framework?\t\t\t\t', '\n\t\t\t\t\tAn introduction to Keras: the most popular Deep Learning framework\t\t\t\t', '\n\t\t\t\t\tA detailed discussion on tensors, why it is so important in deep learning?\t\t\t\t', '\n\t\t\t\t\t ............................................................................................................... More lines
Creating data frame from the list
Web scraping in data science is incomplete unless we have a data frame of the content. So, here we create a data frame from the list titles. We give the index name of the column as “Blog title”. Some basic information and the first 10 rows of the data frame are displayed.
import pandas as pd
df1 = pd.DataFrame({'Blog title': titles})
print(df1.info())
df1.head(10)
The web scraped data is now in a data frame. But there are some unwanted spaces with character. Let’s clean it to have more refined data set.
# Replacing unwanted characters with a blank at the end of column 9
df1["Blog title"] = df1["Blog title"].str.replace('\n\t\t\t\t\t', '')
df1["Blog title"] = df1["Blog title"].str.replace('\t\t\t\t', '')
df1.head(10)
So, here is the final data set containing titles of all the articles the web site contains. The first 10 article titles are displayed here. Congratulations!!! you have successfully completed a whole exercise of web scraping in data science. Below is a look of the final data frame.
The below piece of code creates a comma separated file to store the data frame. You can open it in excel for further analysis.
df1.to_csv('titles.csv')
Final words
So, we have thoroughly learned all the techniques of web scraping in data science. The article presents the basic logic behind scraping any kind of data from a particular web page and also from multiple pages.
These two are the most prevalent scenarios when we want to web scrape for data science. I have used the same web site for both the example but the logic is same for any kind of source. Personally I have found these logic very useful and applied for my analysis quite a lot.
The article will help you in many ways to collect your data of interest and take an informed decision. As the use of the internet grows astronomically, the businesses become more dependent on data. This era where data is the ultimate power and you need to use it wisely to survive the competition. And in this context, web scraping of data can give you a significant competitive advantage.
So, I am hopeful that the article will also help you in your tasks for web scraping in data science. It is very useful as well as an interesting trick. Don’t just read it, copy the code and run in your machine to see the result. Apply the same with any other source and see how it works.
Finally let me know if you find the article helpful. Post if you have any questions in the comment. I will love to answer them 🙂