Python functions for data science: a quick brush up

Python functions

This article contains a brief discussion on python functions. In any programming language, be it Python, R, Scala or anything else, functions play a very important role. Data science projects require some repetitive tasks to perform every time to filter the raw data and while data preprocessing. In this case, functions are the best friend of a data scientist. They save them from doing the same task every time by simply calling the relevant function.

Functions, both inbuilt and user-defined are a very basic yet critical component in any programming language and python is no exception. And here is a brief idea about them, so that you can start using the benefit they provide.

Why use Python for data science? Python is the most favourite language among data enthusiasts. One of the reason is Python is very easy to understand and code with compare to any other language.

Besides, there are lots of libraries from third parties which make data science tasks a lot easier. Libraries like Pandas, NumPy, Scikit-Learn, Matplotlib, seaborn all contain numerous modules catering almost all kind of function you wish to perform in data science. Libraries like Tensorflow, Keras are specially designed for deep learning applications.

Please read these articles about the use of Python in Machine Learning and Deep Learning to know more about the use of Python in data science.

If you are a beginner or you have some basic ideas about coding in other programming languages, this article will help you get into python functions as well as creating a new one. I will discuss here some important Python functions, writing your own functions for repetitive tasks, handling Pandas data structure with easy examples.

Like other objects of python like integer, string and other data types function are also considered as the first-class citizen in python. They can be dynamically created, destroyed, defined in other functions, passed as arguments in other functions, returned as values etc.

Particularly if we consider the field of data science, we need to perform several mathematical operations and pass on calculated values further. So, the role of python functions is very crucial in data science to perform any particular repetitive calculation, as nested function, to be used as argument of another function etc.

So without much ado, lets jump into details of it and some really interesting use of function with examples.

Use of Python functions for data science

Using functions is of utmost importance not only in Python but in any programming language. Be it inbuilt function or user-defined functions you should have a clear idea how to use them. Functions are very powerful to make your coding well structured and increases its usability.

Some functions are there in Python, we just need to call these built in functions to perform the assigned tasks. Most of the basic tasks we need to do frequently in data operations are well covered in these functions. To start with I will discuss some of these important built in python functions.

Built in python functions

Let’s start with some important inbuilt functions of Python. These are already included and makes your coding experience much smoother. The only condition is you have to aware of them and frequently use them. The first function we will discuss is help().

So take help()

Python functions take care of most of the tasks we want to perform through coding. But the common question comes into any beginner’s mind is how will he/she know about all these functions?. The answer is to take help.

The help function is there in Python to tell you every detail about any functions you need to know to use them. You just need to mention the function with help. See the example below.

# Using help
help(print)

Here I want to know about the print function, so I mentioned it within the help. Now see the help describes everything you need to know to apply the function. The function header with optional arguments you need to pass, their role. It also contains a brief description of the function, what it does in English.

Interestingly you can know all about the help() function using the help function itself :). It is great to see the output. Please type to see it yourself.

# Using help() for help
help(help)

Again here help has produced all necessary details about itself. It says that help() function is actually a wrapper around pydoc.help that provides a helpful message for the user when he types “help” in the Python interactive prompt.

List() function

A list is a collection of objects of same or different data types. It has very frequent use in storing data and later used for operations in data science. See the below code to create a list with different data types.

# Defining the list item
list_example=["Python", 10, [1,2], {4,5,6}]
# Printing the data type
print(type(list_example))
# Printing the list
print(list_example)
# Using append function to add items
list_example.append(["new item added"])
print(list_example)

Above code creates a list with a string, a digit, array and set. The type function to print the type of data. And at last, the append() function used to add an extra item in the list. Let’s see the output.

So, the data type is list. All the list items are printed. And an item is appended in the list with append() function. Note this function as it is very handy while performing data analysis. You can also create a complete list from scratch only using the append() function, see the below example.

sorted() function

This is also an important function we need frequently while doing numeric computation. For example a very basic use of sort() is while calculating the median of a sample data. To find out the median, we need to sort the data first. By default the function sort() arrange the data in ascending order, but you can do the reverse also by using the reverse argument. See the example below.

# Example of sorted function
list_new=[5,2,1,6,7,4,9]
# Sorting and printing the list
print("Sorting in an ascending order:",sorted(list_new))
# Soritng the list in descending order and printing
print("Sorting in an descending order:",sorted(list_new,reverse=True))

And the output of the above code is as below:

Use of function sorted()

round() function

This function is useful to give you numbers with desired decimal places. The required decimal place is to be passed as an argument. These decimal number has some unique properties. See the below example and try to guess what will be the output, it is really interesting.

# Example of round() function
print(round(37234.154))
print(round(37234.154,2))
print(round(37234.154,1))
print(round(37234.154,-2))
print(round(37234.154,-3))

Can you guess the output. See the second argument can be negative also!. Lets see the output and then explain what the function does to a number.

When the round() function has no argument, it simply discards any decimal digits. It keeps up to two decimals if the argument is 2 and one decimal when it is 1. Now when the second argument is -2 or -3, it simply returns the closest integer with multiple of 100 or 1000.

If you are surprised where on the earth such a feature is useful; then let me tell you that there are some occasions like mentioning a big amount (money, distance, population etc) where we don’t need an exact figure, rather a rounded close number can do the job. In such cases to make the figure easier to remember, round() function with a negative argument is used.

Now there are a lot more in-built functions, we will touch them in other articles. Here as an example I have covered few of them. Lets move on to the next section of user-defined function. It gives you freedom to create your own functions.

User defined functions

After inbuilt functions, here we will learn about user defined functions. If you are learning Python as your first programming language, then I should tell you that functions in any programming language are the most effective as well as an interesting part.

Any coder’s expertise depends on how skilled he is in creating functions to automate the repetitive tasks. Instead of writing code for the same tasks again and again a skilled programmer writes some function for those tasks and just call them when the need arises.

Below is an example how can you create a function of adding two numbers.

# An example of user defined function
def add (x,y):
  ''' This is a function to add two numbers'''
  total=x+y
  print("The sum of x and y is:", total)

The above is an example of creating a function which will add two numbers and then print the output. Let’s call the function to add two numbers and see the result.

I have called the function, passed two digits as arguments and the user-defined function printed the result of adding the numbers. Now anytime I will need to add two numbers I can just call this function instead of writing those few lines again and again.

Now if we want to use help for this function, what will help return? Lets see

See help() function has returned the text I have put within three quoted strings. It is called the docstring. A docstring allows us to describe the use of the function. It is very helpful as complex programmes require a lot of user-defined functions. The function name should indicate its use but many a time it may not enough. In such cases, a brief docstring is very helpful to quickly remind you about the function.

Optional arguments in user-defined function

Sometimes providing an optional argument with the default argument save us writing additional lines. See the following example:

# Defining functions
def hi(Hello="World"):
  print ("Hello",Hello)

hi()
hi("Python")
hi()

Can you guess the output of the following function calls? Just for fun try without seeing the below output. While trying notice that once the function has been called with an optional argument.

Here is the output.

See for the first call of the function, it has printed the default argument. But when we passed “python” as an optional argument, it has overridden the default argument. Again in the third case without any optional argument, the default gets printed. You should try any other combinations come in your mind, it is complete fun and also your concept will get clear.

Nested functions

Nested functions are when you define functions inside another function. This is also one of the very important python basics for data science. Below is an example of a very simple nested function. Try it yourself to check the output.

# Example of nested functions
def outer_function(msg):
  # This is the outer function
  def inner_function():
    print(msg)
  # Calling the inner function
  inner_function()
# Calling the outer function
outer_function("Hello world")

Functions passed as argument of another function

Functions can also be passed as an argument of another function. It may sound a little confusing at first. But it is really a very powerful property among the python basics utilities for data science. First, take an example to discuss it. See the below piece of code to check the property.

# Calling functions in a function
def add(x):
  return 5+x

def call(fn, arg):
  return (fn(arg))

def call_twice(fn, arg):
  return(fn (fn(arg)))
print(
    call(add, 5),
    call_twice(add, 5),
    sep="\n"
)

Again you try to understand the logic and guess the output. Copy the code and make little changes to see the change or error it produces. The output I got from this code as below.

Did you guess it right? See here we have created three functions namely add(), call() and call_twice(). And then passed the add() function into other two functions. The call() function has returned the add function with argument 5 so the output is 10.

In a similar fashion, the call_twice() function has returned 15 due to the fact that it has a return statement with a nested function and argument combination. I know it is confusing to some extent. This is because the logic has not come from a purpose. When you will create such functions to really solve some problem the concept will get clear. So, do some practice with the code given here.

References

How to create your first machine learning project: a comprehensive guide

Your first machine learning project

This article is to help you to start with your first machine learning project. Machine learning projects are very important if you are serious about your career as a data scientist. You need to build your profile with a number of machine learning projects. These projects are evidence of your proficiency and skill in this field.

The projects are not necessarily only complex problems. They can be very basic with simple problems. What is important is to complete them. Ideally, in the beginning, you should take a small project and finish it. It will boost your confidence as you have successfully completed it as well as you will get to learn many new things.

So, to start with I have also selected a very basic problem which is the classification of Iris data set. You can compare it with the very basic “Hello world” program that every programmer writes as a beginner. The data set is small that’s why easy to load in your computer; consists of a few no. of features only so implementation of any ML algorithm is easier.

I have used here Google Colab to execute the Python code. You can try any IDE you generally use. Feel free to copy the code given here and execute them. The first step is to use the existing code without any error. Afterwards, make little changes to see how the output gets affected or gives errors. This is the most effective way to know a new language as well as its application in Machine Learning.

The steps for first machine learning project

So, without much ado, lets jump to the project. You first need to chalk out the steps of implementing the project.

  • Importing the python libraries
  • Importing and loading the data set
  • Exploring the data set to have a preliminary idea about the variables
  • Identifying the target and feature variables and the independent-dependent relationship between them
  • Creating training and testing data set
  • Model building and fitting
  • Testing the data set
  • Checking model performance with comparison metrics

This is an ideal sequence how you should proceed with the project. As you gain experience you will not have to remember them. Being the first machine learning project I felt it necessary to mention them for further reference.

Importing the required libraries

# Importing required libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
import numpy as np

About the data

The data is collected from UCI machine learning repository, Iris data set and created by Dr R. A. Fisher. It contains three Iris species viz. “Setosa”, “Versicolor” and “Virginica” and four flower feature namely petal length, petal width, sepal length and sepal width in cm. Each of the species represents a class and has 50 samples each in the data set. So the Iris data has total 150 samples.

This is the most popular and basic data used in pattern recognition to date. The data source is UCI machine learning repository and it is a little different from the same Iris data set found in R.

The following line of code will load the data set in your working environment.

# Loading the data set
dataset = load_iris()

The following code will generate a detail description of the data set.

# Printing some data features
dataset.DESCR

Description of Iris data

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

    ============== ==== ==== ======= ===== ====================
                    Min  Max   Mean    SD   Class Correlation
    ============== ==== ==== ======= ===== ====================
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)
    ============== ==== ==== ======= ===== ====================

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.

This is perhaps the best known database to be found in the
pattern recognition literature.  Fisher's paper is a classic in the field and
is referenced frequently to this day.  (See Duda & Hart, for example.)  The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant.  One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

.. topic:: References

   - Fisher, R.A. "The use of multiple measurements in taxonomic problems"
     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
     Mathematical Statistics" (John Wiley, NY, 1950).
   - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
     Structure and Classification Rule for Recognition in Partially Exposed
     Environments".  IEEE Transactions on Pattern Analysis and Machine
     Intelligence, Vol. PAMI-2, No. 1, 67-71.
   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
     on Information Theory, May 1972, 431-433.
   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
     conceptual clustering system finds 3 classes in the data.
   - Many, many more ...

Checking the data type

We can check the data type before proceeding for analytical steps. Use the following code for checking the data type:

# Checking the data type 
print(type(dataset))

Now here is a problem with the data type. Check the output below, it says it is a sklearn data.

Data type

Although the most common data type we are used to is Pnadas dataframe. And also the target and feature are stored here separately. You can print them separately using the following lines.

# Printing the components of Iris data
print(dataset.target_names)
print(dataset.target)
print(dataset.feature_names)

See the print output below. The target variables are the three Iris species “Setosa”, “Versicolor” and “Virginica” which are coded as 0,1 and 2 respectively. And the features are also stored separately.

First machine learning project: Components of Iris data set
Components of Iris data set

And the feature values are stored separately as data. Here is first few rows of the data.

# Printing the feature data
print(dataset.data)
First machine learning project data set view

Converting the data type

For the ease of further modelling process, we need to convert the data type from sklearn to the most common Pandas data type. And we also need to concatenate the separate data and target with column names as feature_names and target. The np.c_ function concatenates the data set.

# Converting scikit learn dataset to a pandas dataframe
import pandas as pd
df = pd.DataFrame(data= np.c_[dataset['data'], dataset['target']],columns= dataset['feature_names'] + ['target'])
df.head()

See below few lines of the combined dataframe. With this new dataframe we are now ready to proceed for the next step.

Panda data frame for your First machine learning project
The new Panda dataframe

Check the shape of the newly created dataframe as I have done below. The output confirms that the dataframe is now complete with 150 samples and 5 columns.

# Printing the shape of the newly created dataframe
print(df.shape)

Creating target and feature variables

Next, we need to create variables storing the dependent and independent variables. Here the target variable Iris species is dependent on the feature_variables so the flower properties i.e. petal width, petal length, sepal length and sepal width are independent variables.

The data set printed above, you can see that the first four columns are independent variables and the last one has the dependent variable. So, in the below line of codes, variable x is to store the values of first four columns and y for the target variable.

# Creating target and feature variables
x=df.iloc[:,0:4].values
y=df.iloc[:,4].values
print(x.shape)
print(y.shape)

The shape of x and y is as below.

Shape of x and y

Splitting the data set

We need to split the data set before applying Machine learning algorithms. The train_test_split() function of sklearn has been used here to do the task. The test data size is set as 20% of the data.

# Splitting the data set into train and test set
x_train, x_test, y_train, y_test=train_test_split(x,y,test_size=0.2,random_state=0)
print(x_train.shape)
print(x_test.shape)

Accordingly, the train data set contains 120 sample data whereas the test data set has 30 sample data.

Application of Decision tree algorithm

So, we have finished with data processing steps and ready to apply the Machine Learning algorithm. I have chosen here a very popular classification algorithm which is Decision Tree algorithm for the first machine learning project.

If this algorithm is new to you, you can refer to this article to learn details about it and how it can be applied with Python. The speciality of this ML algorithm is that its logic is very simple and the process is not black box like most other ML algorithms. Which means that we can see and understand how the decision-making process is going on.

So let’s apply this ML model to the training set of Iris data. The DecisionTreeClassifier() of sklearn is the function here which we have imported in the beginning.

# Application of Decision Tree classification algorithm
dt=DecisionTreeClassifier()
# Fitting the dt model
dt.fit(x_train, y_train)

The model thus applied on the training set. In the below screenshot of my Colab notebook you can see the classifier has several parameters specifying the decision tree formation. At this stage you don’t need to bother about all these specifications. We can discuss each of them and what is their function in another article.

DecisionTreeClassification model fit:First machine learning project
Fitting the Decision Tree Classification model

Prediction using the trained model

To test the model we will first create a new data. As this data has not been used in model building so the prediction will not be biased.

# Creating a new feature set i.e. a new flower properties
x_new = np.array([[4.9,	3.0,	1.4,	0.2]])
# Predicting for the new data using the trained model
prediction = knn.predict(x_new)
print("Prediction:",prediction)

See the prediction result using the trained Decision Tree classifier. It gives the result as 0 which represents the iris species “Setosa”. We have discussed before the Iris species are represented in the data frame with digits 0,1 and 2.

Prediction for the new data

Lets try to predict the result using the test set with 20% of data kept independent while model training. We will also use two metrics suggesting the goodness of fit of the model.

y_pred = dt.predict(x_test)
print("Predictions for the test set:",y_pred)
# Metrics for goodness of fit 
print("np.mean: ",np.mean  (y_pred == y_test))
print("dt.score:", dt.score(x_test, y_test))

And the output of the above piece of code is as below.

Prediction using the test set

You can see that the testing accuracy score is 1.0!. So, it is indicating a problem. The problem of overfitting. Which is very common with Decision Tree Classification. Overfitting suggests that the model is a too good fit for this particular data set. Which is not desirable. And ideally, we should try other machine learning models to check their performance.

So in this section next we will not take up a single ML algorithm, rather we will take up a bunch of ML algorithms and test their performance side by side to choose the best performing one.

Application of more than one ML models simultaneously

In this section, we will fit multiple ML algorithms at a time to classify the Iris data and see which one of them is the most accurate. The ML algorithms we will use here are Linear Discriminant Analysis, Naive Bayes classifier, Logistic regression, Support Vector Machine, K-Nearest Neighbour classifier and also Decision tree classifier which we have already applied before. Here I am including it too just to compare it with the others.

Along with these ML models another segment which I am going to introduce is known as Ensemble models. The specialty of this method is that an ensemble model uses more than one machine learning models at a time to achieve more accurate estimation. See the below figure to understand the process.

A schematic diagram of ensemble method
An ensemble model

Now there are two kinds of ensemble models which are Bagging and Boosting. I have incorporated both kinds of ensemble models here to compare them with other machine learning algorithms. Here is a brief idea about Bagging and Boosting ensemble techniques.

Bagging

The name is actually Bootstrap Aggregation. It is essentially a random sampling technique with replacement. That means here once a sample unit is selected, it is again replaced back for further future selection. This method works best with algorithms which tend to have higher variance and bias, like decision tree algorithm.

Bagging method runs a different model separately and for the final prediction output aggregates each model’s estimation without any bias to any model.

The other ensemble modelling technique is:

Boosting

As an ensemble learning method, boosting also comprises a number of modelling algorithm for prediction. It associates weight to make a weak learning algorithm stronger and thus improving the prediction. The learning algorithms also learn from each other to boost the overall model performance.

The ensemble models we are going to use here are AdaBoostClassifier(), BaggingClassifier(), ExtraTreesClassifier(), GradientBoostingClassifier() and RandomForestClassifier(). All are from sklearn library.

Importing required libraries

# Importing libraries
from sklearn.model_selection import cross_val_score
from sklearn import ensemble
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
import matplotlib.pyplot as plt
import seaborn as sns

Application of all the models

Use this following lines of code to build, train and execute all the six models. It also consists of an array with name ml_compare[]. It stores all the comparison matrices calculated here.

# Application of all the ML algorithms at a time
ml = []
ml.append(('LDA', LinearDiscriminantAnalysis())),
ml.append(('DTC', DecisionTreeClassifier())),
ml.append(('GNB', GaussianNB())),
ml.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr'))),
ml.append(('SVM', SVC(gamma='auto'))),
ml.append(('KNN', KNeighborsClassifier())),
ml.append(("Ensemble_AdaBoost", ensemble.AdaBoostClassifier()))
ml.append(("Ensemble_Bagging", ensemble.BaggingClassifier()))
ml.append(("Ensemble_Extratree", ensemble.ExtraTreesClassifier()))
ml.append(("Ensemble_GradientBoosting", ensemble.GradientBoostingClassifier()))
ml.append(("Ensemble_RandomForest", ensemble.RandomForestClassifier()))

ml_cols=[]
ml_compare=pd.DataFrame(columns=ml_cols)
row_index=0
# Model evaluation
for name, model in ml:
  model.fit(x_train,y_train)
  predicted=model.predict(x_test)
  kfold = StratifiedKFold(n_splits=10, random_state=1, shuffle=True)
  cv_results = cross_val_score(model, x_train, y_train, cv=kfold, scoring='accuracy')
  ml_compare.loc[row_index, 'Model used']=name
  ml_compare.loc[row_index,"Cross Validation Score" ]=round(cv_results.mean(),4)
  ml_compare.loc[row_index,"Cross Value SD" ]=round(cv_results.std(),4)
  ml_compare.loc[row_index,'Train Accuracy'] = round(model.score(x_train, y_train), 4)
  ml_compare.loc[row_index,"Test accuracy" ]=round(model.score(x_test, y_test),4)

  row_index+=1

ml_compare

As all the models get trained and executed with the train set, they are simultaneously tested with the test data. The goodness of fit statistics gets stored in ml_compare[]. So, let’s see now what ml_compare[] tells us. The output is as below.

Comparative table of cross validation score of all the models

Visual comparison of the models

Although from the above table the models can be compared, it is always easier if there is a way to visualize the difference. So, let’s create a bar chart using the cross-validation score. we have calculated above. Use the following line of codes to create the bar chart with the help of matplotlib and seaborn module of sklearn.

# Creating plot to show the train accuracy
plt.subplots(figsize=(13,5))
sns.barplot(x="Model used", y="Train Accuracy",data=ml_compare,palette='hot',edgecolor=sns.color_palette('dark',7))
plt.xticks(rotation=90)
plt.title('Model Train Accuracy Comparison')
plt.show()

As the above code executes, the following bar chart is created showing the cross-validation scores of all the ML algorithms.

The verdict

So, we have classified the Iris data using different types of Machine Learning and ensemble models. And the result shows that they all are more or less accurate in identifying the Iris species correctly. But if still, we need to pick any one of them as the best, then we can do that based on the above comparative table as well as the graph.

For this instance, we have Linear Discriminant and Support Vector Machine performing slightly better than the others. But it can vary depending on the size of data and ML scores do change in different executions. You also check your result, which one you have found best and let me know through comments below.

Conclusion

So, congratulations you have successfully completed your very firs machine learning project with python. You have used a popular and classic data set to apply several machine learning algorithms. The data being a multiclass data set is an ideal example of real world classification problem.

To find out the best performing model, we have applied the six most popular Machine Learning algorithms along with several ensemble models. To start with the model building process, first of all, the data set has been divided into training and testing sets.

The training set is to build and train the model. The test data set is an independent data set kept aside while building the model, to test the model’s performance. This is an empirical process of model validation when independent data collection is not possible. For this project, we have taken an 80:20 ratio for train and test data set.

And at the last a no. of comparison metrics were used to find the model with the highest accuracy. These are essentially the ideal steps of any machine learning project. As it is your first machine learning project experience, so I have showed every step with all details. As you advance in experience you may skip some of them as per your convenience.

So, please let me know your experience with the article. Any problem you faced while executing the code or any other queries post them in the comment section below, I will love to answer them.

References:

Data exploration is now super easy with D-tale

D-tale data exploration tool

This article is to introduce you a really super easy data exploration tool from Python. You have to just install and import this simple module. It gets integrated with any python IDE you are using. And D-tale is ready with all its data exploration features and a very easy user interface.

Data exploration is a very basic yet very important step of data analysis. You need to understand the data the relationship between the variables before you dive deep into advance analysis. Basic data exploration techniques like visual interpretation, calculating the summary statistics, identifying the outliers, mathematical operations on variables etc. are very effective to gain a quick idea about your data.

These data exploration steps are necessary for any data science projects. Even in machine learning and deep learning projects also we filter our data through these data exploration techniques. And they involve writing a few lines of Python code which are usually repetitive in nature.

This is a complete mechanical task and writing reusable code helps a bit. But again you need code manipulation to some extent every time new data set in use. Every time we write “dataset.head()” wishing that had there been a user interface to do these basic tasks, it can be a big time saver.

So here comes D-tale to rescue us. D-tale is actually a lightweight web client developed over the Pandas data structure. It provides an easy user interface to perform several data exploration tasks without the need of writing any code.

What is D-tale?

D-tale is an open-source solution developed by SAS to Python conversion for visualizing your data using Pandas data frame. It encapsulates all the coding for implementing Pandas data structure operations in the backend so that you don’t need to bother about coding the same thing repeatedly.

SAS insight function earlier which eventually transformed into D-tale with a wrapper written in pearl script. D-tale also gets easily integrated with python terminals and ipython notebooks. You just need to install it in Python and then import it.

You can refer this link for further knowledge about this tool. It is from the developers and also contains some useful resources. Here is a good video resource by the developer of D-tale Andrew Schonfeld from FlaskCon 2020.

A video lesson on D-tale by Andrew Schonfeld

I am using it for some time and really liked it. It has made some of my regular repetitive data exploration tasks very easy. It saves lots of my time.

Here I will discuss in detail how can it be installed and start to use with screenshots from my computer while I have installed it.

Installation

The installation part is also a breeze. Within seconds you can install it and start to use. Just open your Anaconda Powershell Prompt from windows start. See the image below.

Opening Anaconda Powershell Prompt from start
Opening Anaconda Powershell Prompt from start

Now type the following command in Anaconda Powershell Prompt to install the D-tale in your windows.

# To install D-tale 
conda install dtale -c conda-forge
pip install -U dtale

Below is the screenshot of my computer’s anaconda shell. Both the Conda and Pip command has been executed. As you know that both of these commands function in a similar way. The only difference is pip installs from the Python package index whereas Conda installs packages from Anaconda repository.

Now you are ready to use the D-tale. Open your Jupyter notebook and execute the following codes.

# To import Pandas
import pandas as pd
# To install D-tale
import dtale
Importing D-tale module in Jupyter notebook
Importing D-tale module in Jupyter notebook

Example data set

The example dataset I have used here for demonstration purpose has been downloaded from kaggle.com. The data collected by “National Institute of Diabetes and Digestive and Kidney Diseases”  contains vital parameters of diabetes patients belong to Pima Indian heritage.

Here is a glimpse of the first ten rows of the data set. I have imported the data set in CSV format using the usual pd.read_csv() command. And to show the table use dtale.show().

D-tale opened in browser
D-tale working pane

The data set has independent variables as several physiological parameters of a diabetes patient. The dependent variable is if the patient is suffering from diabetes or not. Here the dependent column contains binary variable 1 indicating the person is suffering from diabetes and 0 he is not a patient of diabetes.

Data exploration with D-tale

Now you have the Jupyter notebook displaying the data. And you can click the arrow button on the top left-hand corner to open all the data manipulation tools. See the below image the left pan has several options like describe, build column, correlations, charts etc.

Data exploration tools in D-tale
Data exploration tools in D-tale

Descriptive statistics for variables

This is to describe variables showing some descriptive or summary statistics. It does the same task as df.describe() of pandas does. D-tale enables you to get the same result without writing any code, just click the “describe” from the left panel.

In the below image you can show the descriptive statistics of the variable “Pregnancies” has been displayed along with a box-whisker plot. Select any other variable from the left menu and the summary statistics of that particular variable will be displayed.

Describing a variable-data exploration
Describing a variable

Calculation of correlation among the variables

Here is an example of calculating the correlations among the variables. You can see that just on clicking the correlation D-tale has created a beautiful correlation table among all the variables. The depth of colours is very useful to indicate the correlated variables on a glimpse. Like here the dark shade indicated higher correlation.

Correlations between the variables-data exploration
Correlations between the variables

Preparing charts

Chart creation is a very basic yet very useful data exploration technique. Using D-tale you can create different types of charts like Bar, Line, Scatter, Pie, Heatmap etc. D-tale through its interface has done away writing of several lines of codes. Below is an example of creating a scatter plot with this tool.

As you select the chart option from the left panel of D-tale, a new tab in the browser will open with the following options. You need to select the variables you want to create a scatter plot. There are options to choose X and Y variables. You can also use group by the option to select if there is any categorical variable.

Chart creation wizard
Chart creation wizard

If you desire, also select any of the aggregation options available there, or simply go for the scatter option above. A scatter plot between the two variables will be displayed. Below is the scatter plot with all the options for your reference.

Sections of a chart created in D-tale-data exploration
Sections of a chart created in D-tale

The scatter plot contains some tool options as shown in the above image. These tools help you to further dig into the plot’s details. You can investigate any particular point of interest with options like box select or lasso select, change axes setting, to see the data on hover etc.

Other very helpful options to use the chart created here are available as shown in the figure. Like option to pop up the chart in another tab and compare to the another, a link just copy and share, exporting the chart in static HTML which can be attached with e-mail, data export in CSV and finally allows you to copy the Python code to make further customization.

Highlighting the outliers

Another very good and useful feature of D-tale is to highlight the variable wise outliers in a single click. See the below image.

Highlighting the outliers-data exploration
Highlighting the outliers

Creating a pie chart

Here is an example of a Pie chart created with D-tale. Pie chart is also a very popular chart format to show proportional distribution of different components. Creating pie chart follows the same simple process. Just choose pie chart and then select variables you want to display.

Pie chart-data exploration
Pie chart

Bar plot

Another popular chart format is bar plot. It reveals many important properties of the variables and relation between them. For example here I have created a bar plot between the mean blood pressure against age of different individual. It is an very effective way to know how the blood pressure varies with the age of person. Which is otherwise not easily identifiable from the raw data.

Creating the bar plot is the same and very easy. Here also different aggregation options available. For example I have chosen mean to display the blood pressure along the Y axis.

Creating bar plot-data exploration
Creating bar plot

Code export

It is a very useful option D-tale provides. You can get the code for the particular data exploration technique used by you. Now you can make any desired change or simply understand how to write a standard code for learning purpose.

Here is the code snippet below used for creating the bar plot above.

Code export window
Code export window

Conclusion

This article presents a very helpful data exploration tool which can make your regular data analysis task much easier and quicker. It is a light application and uses Pandas data manipulation libraries underneath.

Its simple and neat user interface gets easily integrated with any IDE you use. Data analysts need a quick idea about the data in hand so that they can plant their advance analytical tasks. So, D-tale can be a tool of choice for them saving considerable time required for writing regular repetitive lines of code.

Wish the article will be helpful. I tried to provide as much information as possible so that you can straightway install and apply it. Do share your experience, how do you find it, is it helpful? Let me know your opinion, any further queries or doubt by commenting below.

How to set up your deep learning workstation: the most comprehensive guide

Set up deel learning workstation

This article contains a step by step detailed guideline to set up a deep learning workstation with Ubuntu 20.04. This is actually a documentation of the process I followed for the same in my computer. I repeated this process a no. of times. And every time I thought I should have documented the process. Proper documentation helps a quick and error-free set up in the next instance.

I have mentioned the most common mistakes and errors during the process and how to avoid or troubleshoot them. Bookmarking this page can help you quickly refer it whenever you get stuck in any of the steps.

I have done this complete setup process a few times in both of my old and new laptops with completely different configurations. So, hope that the problems I faced are the most common one. It took me a considerable time to fix all those issues, mainly by visiting different discussion groups like StackOverflow, Ubuntu discussion forum and many other discussion threads and blogs.

I compiled them in one place here. So that you don’t have to visit multiple sites and refer to this post only to complete the whole installation process. In this way, this documentation will save a lot of your valuable time.

Prerequisites to set up deep learning workstation

I assume that you already have Ubuntu on your computer. If not then please install the latest version of Ubuntu. This is the most famous open-source Linux distribution and available for free download here. Although it is possible to run deep learning Keras models on Windows, it is not recommended.

Why should you use Ubuntu for deep learning? Refer to this article

Another prerequisite for running deep learning models is a good quality GPU. I will advise you to have an NVIDIA GPU in your computer for satisfactory performance. It is a necessary condition not must though. Because running sequence processing using recurrent neural network and image processing through convolutional neural models in CPU is a difficult proposition.

Such models may take hours to give results when run with CPU. Whereas a modern NVIDIA GPU will take merely 5-10 minutes to complete the models. In case if you are not interested to invest for GPU an alternative is using cloud service for computing paying hourly rent.

However, in long run, this using this service may cost you more than upgrading your local system. So, my suggestion will be if you are serious about deep learning and wish to continue with even moderate use, go for a good workstation set up.

The main steps to set up a deep learning workstation

Now I assume that you have already completed with all the prerequisites to set up your deep learning experiments. It is a little time-consuming process. You will require a stable internet connection to download various files. Depending on the internet speed the complete process may take 2-3 hours (with an internet speed of 1gbps in my case it took 2 hours) to complete. The main steps to set up a deep learning workstation are as follow:

  • Updating the Linux system packages
  • Installation of Python pip command. It is the very basic command going to be used to install other components
  • Installing the Basic Linear Algebra Subprogram (BLAS) library required for mathematical operation.
  • HDF5 data frame installation to store hierarchical data
  • Installation of Graphviz to visualize Keras model
  • CUDA and cuDNN NVIDIA graphics drivers installation
  • Installation of TensorFlow as the backend of Keras
  • Keras installation
  • Installation of Theano (optional)

So, we will now proceed with the step by step installation process

Updating the Linux system packages

The following line of commands will complete the process of Linux system up-gradation process. You have to type the commands in Ubuntu terminal. The keyboard shortcut to open the terminal is “Ctrl+Alt+T”. Open the terminal and execute the following lines of code.

$ sudo apt-get update
$ sudo apt-get --assume-yes upgrade

Installing the Python-pip command

The pip command is for installing and managing Python packages. Next which ever packages we are going to install, this pip command will be used. It is an replacement of the earlier command easy_install. Run the following command to install python-pip.

$ sudo apt-get install python-pip python-dev

It should install pip in your computer. But sometimes there may be exceptions. As it happened to me also. See the below screenshot of my Ubuntu terminal. It says “Unable to locate package python-pip”.

It created a big problem as I was clueless about why it is happening. In my old computer, I have used it no. of times without any issue. After scouring the internet for several hours I got the solution. This has to do with the Python version installed in your computer.

If you are also facing the problem (most likely if using a new computer) then first check the python version with this command.

$ ls /bin/python*

If it returns python version 2 (for example python 2.7) then use python2-pip command or if it returns higher version python like python 3.8 then use python3-pip command to install pip. So, now the command will be as below

$ sudo apt-get install python3-pip

Ubuntu by default uses Python 2 while updating its packages. In case you want to use Python 3 then it needs to be explicitly mentioned. Only Python means Python 2 for Ubuntu. So, to change the Python version, use the following code.

# Installing Python3
$ sudo apt-get install python3-pip python3-dev

Installation steps for Python scientific suit in Ubuntu

Here the process discussed are for Windows and Linux Operating systems. For the Mac users they need to install the Python scientific suit via Anaconda. They can install it from the Anaconda repository. It is continuously updated document. The documentation provided in Anaconda is very vivid one with every step in detail.

Installation of the BLAS library

The Basic Liner Algebra Subprogram (BLAS) installation is the first step in setting up your deep learning workstation. But one thing Mac users should keep in mind that this installation does not include Graphviz and HDF5 and they have to install them separately.

Here we will install OpenBLAS using the following command.

$ sudo apt-get install build-essential cmake git unzip \
pkg-config libopenblas-dev liblapack-dev

Installation of Python basic libraries

In the next step, we will need to install the basic Python libraries like NumPy, Panda, PMatplotlib, SciPy etc. These are core Python libraries required for any kind of mathematical operations. So, be it machine learning or deep learning or any kind of computation intensive task, we will need these libraries.

So use the following command in Ubuntu terminal to install all these scientific suite simultaneously.

# installation of Python basic libraries
$ sudo apt-get install python-panda python-numpy python-scipy python- matplotlib python-yaml

Installation of HDF5

The Hierarchical Data Format (HDF) version 5 is an open-source file format which supports large, complex and heterogeneous data sources. It was developed by NASA to store large numeric data files in efficient binary formats. It has been created on the other two hierarchical data formats like HDF4 and NetCDF.

HDF5 data format allows the developer to organize his machine learning/deep learning data in a file directory structure very similar to what we use in any computer. This directory structure can be used to maintain the hierarchy of the data.

If we consider the directory nomenclature in the computer filing system, then the “directory” or “folder” is the “group” and the “files” are the “dataset” in case of HDF5 data format. It has importance in deep learning in order to save and fetch the Keras model from the disc.

Run the following command to install HDF5 in your machine

# Install HDF5 data format to save the Keras models
$ sudo apt-get install libhdf5-serial-dev python-h5py

Installation of modules to visualize Keras model

In the next step we will install two packages called Graphviz and pydot-ng. These two packages are necessary to visualize the Keras model. The codes for installing these two packages are as follow:

# Install graphviz
$ sudo apt-get install graphviz
# Install pydot-ng
$ sudo pip install pydot-ng

These two packages will definitely help you in the execution of the deep learning models you created. But for the time being, you can skip their installation and proceed with the GPU configuration part. Keras can also function without these two packages.

Installation of opencv package

Use the following code to install opencv package

# Install opencv
$ sudo apt-get install python-opencv

Setting up GPU for deep learning

Here comes the most important part. As you know that GPU plays an important role in deep learning modelling. In this section, we are going to set up the GPU support by installing two components namely CUDA and cuDNN. But to function properly they need NVIDIA GPU.

Although you can run your Keras model even in the CPU, it will take much longer time to train a model to compare to the time taken by GPU. So, my advice will be if you are serious about deep learning modelling, then plan to procure an NVIDIA GPU (using cloud service paying hourly rent is also an alternative).

Lets concentrate on the setting up of GPU assuming that your computer already have latest one.

CUDA installation

To install CUDA visit NIVIDIA download page following this link https://developer.nvidia.com/cuda-downloads. You will land in the following page. It will ask for selecting the OS you are using. As we are using Ubuntu here (to know why to use Ubuntu as the preferred OS read this article) so click Ubuntu.

CUDA installation-OS selection
CUDA installation-OS selection

Then it will ask other specifications of your workstation environment. Select them as per your existing specifications. Like here I have selected OS as Linux. I am using a Dell Latitude 3400 laptop which is a 64 bit computer, so in next option I selected x86_64; the Linux distribution is Ubuntu version 20.04.

Finally the installer type you have to select. Here I have selected the network installer mainly because it has comparatively smaller download size. I am using my mobile internet for the time being. So, it was the best option for me. But you can choose any of the other local installation options if there is no constrain of internet bandwidth. The plus point of local installation is you have to do this only once.

CUDA installation-specification selection
CUDA installation-specification selection

As all the specifications are mentioned, NVIDIA will provide you the installer. Copy the code from there and run in Ubuntu terminal. It will use Ubuntu’s apt to install the packages, which is the most easiest way to install CUDA.

CUDA installation code
CUDA installation code
$ wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
$ sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
$ sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub
$ sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /"
$ sudo apt-get update
$ sudo apt-get -y install cuda

Install cuDNN

“cuDNN is a powerful library for Machine Learning. It has been developed to help developers like yourself to accelerate the next generation of world changing applications.”

NVDIA.com

To download the specific cuDNN file for your operating system and linux distribution you have to visit the NIVIDIA download page.

Downloading cuDNN
Downloading cuDNN

To download the library, you have to create an account with NVIDIA. It is a compulsory step.

NVIDIA membership for Downloading cuDNN
NVIDIA membership for Downloading cuDNN

Fill in the necessary fields.

NVIDIA membership for Downloading cuDNN

As you finish registration a window with some optional settings will appear. You can skip them and proceed for the next step.

NVIDIA membership for Downloading cuDNN

A short survey by NIVIDIA is the next step. Although it is on the experience as developer, you can fill it with any of the options just to navigate to the download page.

Download survey for cuDNN
Download survey for cuDNN

Now the page with several download options will appear and you have to choose according to your specifications. I have selected the following debian file for my workstation.

Selecting the OS for cuDNN download
Selecting the OS for cuDNN download

Download the file (the file size is around 300mb in my case). Now to install the library, first change the directory to enter in the download folder and execute the install command.

Once you are in the directory where the library has been downloaded (by default it is the download folder of your computer) run the command below. Use the filename in place of **** in the command.

$ sudo dpkg -i dpkg -i ******.deb

You can follow the installation process from this page. With this the cuDNN installation is completed.

Installation of TensorFlow

The next step is installation of TensorFlow. It is very simple. Just execute the below command to install TensorFlow without GPU support using the pip command.


# Installing TensorFlow using pip3 command for Python3
$ sudo pip3 install tensorflow

Installing Keras

This is the final step of setting up your deep learning workstation and you are good to go. You can run the simple below command.

$ sudo pip3 install keras

Or you can install it from Github too. The benefits of installing Keras from Github are that you will get lots of example codes from there. You can run those example scripts to test them on your machine. These are very good source of learning.

$ git clone https://github.com/fchollet/keras
$ cd keras
$ sudo python setup.py install

Optional installation of Theano

Installation of Theano is optional as we have already installed TensorFlow. However, installing Theano can prove advantageous while building Keras code and switching between TensorFlow and Theano. Execute the code below to finish installing Theano:

$ sudo pip3 install theano

Congratulations !!! you have finished with all installations and completed the set up for your deep learning workstation. You are now ready to execute your first code of deep learning neural network.

I hope this article will prove helpful to set up your deep learning workstation. It is indeed a lengthy article but covers all technicalities which you may need in case of any difficulty during the process. A little knowledge about every component you are installing also helps you to make any further changes in the setting.

Let me know how you find this article by commenting below. Please mention if any information I missed or any doubt you have regarding the process. I will try my best to provide the information.

Why Ubuntu is the best for Deep Learning Framework?

Ubuntu for deep learning

Why use Ubuntu for deep learning? This is the question this article tries to answer. After reading this article you will not have any doubt regarding which platform you should use for your deep learning experiments.

I was also quite happy with my windows 10 and Colab/Jupyter notebook combination for all of my Artificial Intelligence (AI)/Machine Learning(ML)/Deep Learning(DL) programming. Until I decided to start some serious work with deep learning neural network models.

Is it really important?

“[M]achines of this character can behave in a very complicated manner when the number of units is large”

Alan Turing (1948), “Intelligent Machines”, page 6

Soon I started my first model building, the limitation of my present working environment came into my notice. In some forums like Quora, Reddit etc. I was reading some threads on deep learning. And suddenly someone there in his reply mentioned that Ubuntu is a better choice for serious application in deep learning.

Suddenly it struck me that probably it is not a wise choice to continue with Windows for advanced application of deep learning and AI. But it was just a hunch that time. And I needed strong logical points before I made my mind to switch an OS which the only platform I have ever used.

So, I started scouring through internet. Read almost all blogs, threads in discussion forum to make sure if switching the platform really worth my time. Because getting aquiented with a completely new OS takes time and time is money for me.

If anyone of you also in the learning phase and serious about deep learning, this article will help him/her to make an informed decision to select which platform he/she should use. Because it is always a waste of valuable time to switch your working environment at a later stage of learning and lots of rework too.

I have already done the heavy work for you and presenting a vivid description of the topic so that you get all your questions answered at one place.

So let’s start with an introduction with Ubuntu. Although this is not an article on Ubuntu. You can find so many good articles on Ubuntu. But still before knowing its special features concerning deep learning, here is a very brief idea.

What is Ubuntu?

Ubuntu is one of the most popular forms of Linux distribution. It is developed by Mark Shuttleworth of Canonical lab. It is also the most famous open source technology that means all features and applications it offers are completely free. And it is an undeniable fact that being free makes an application mile ahead in popularity automatically.

“What commercialism has brought into Linux has been the incentive to make a good distribution that is easy to use and that has all the packaging issues worked out.”

Linus Torvalds, Principal developer of the Linux kernel

Being open-source, Ubuntu offers its a new update almost twice in a year while Long Term Support(LTS) releases after every two years with updated security patches. It has three main categories for distribution which are core, desktop and Server.

The core version is mainly for those working on IoT devices and robotics. The desktop version is for common users doing day to day office tasks and also programming applications. The server version is obviously for client-server architecture and generally meant for industry uses.

Why Ubuntu is preferred for deep learning?

The Ubuntu version I installed recently is version 20.04 and it is the latest version on this distro. It is a much-improved version than its predecessor. Especially the additional supports it provides for AI, ML and DL programmer is just stupendous.

The MicroK8s feature

“Given its smaller footprint, MicroK8s is ideal for IoT devices- you can even use it on a Raspberry Pi device!”

Kubernets.io, Technical blogs
MicroK8s integration in Ubuntu
MicroK8s integration in Ubuntu

The user interface has improved a lot. The installation process has become very easy (it was always smooth though). The Ubuntu 20.04 version now comes with support for ZFS (a file system with high availability and data integrity) and an integrated module called Microk8s. So, the AI, DL developers now don’t have to install it separately.

Microk8s enables the AI application module to get set up and deployed blazing fast. It comes preloaded with all necessary dependencies like automatic update and security patches. Quite obvious that with this version of Ubuntu you will now need to spend much lesser time to configure the environment.

Kubeflow

It is another deep learning edge of Ubuntu 20.04 and comes as an add on to Microk8s. Kubeflow was developed by Google in collaboration with Canonical, especially for Machine Learning applications. It provides inbuilt GPU acceleration for deep learning R&D.

What is Kubeflow?

Kubeflow deployed with Kubernetes and do away with the barrier to create production-ready stacks. It provides developers with enhanced AI, ML capabilities with edge computing feature. The researchers and developers involved in cutting edge research activities get a secured production environment with strict confinement in complete isolation.

Kubeflow architecture
Source: Kubeflow blog by Thea Lamkin

The security provided by Kubeflow and Kubernetes integration is unparallel. Many AI/ML/DL development add ons like Jaeger, Istio, CoreDNS, Prometheus, Knative etc come integrated with it and can be deployed with a single command.

The programming edge of Ubuntu

When it comes to programming activities, Ubuntu is undoubtedly the leader. Not only for AI, ML or DL programming but any kind of programming task and application development task is best performed when the operating system is Ubuntu.

It has the best libraries, vast examples and tutorials readily available for users. The support for all open source software used with Ubuntu is massive to solve any issue you face quickly. The updates are also regular and irrespective of which version you are using.

The enhanced Graphics Processing Unit

Powerful GPU is an important component for serious ML/DL programming. Ubuntu has an edge here to make any contemporary changes in AI environment. NVIDIA the respectable name in GPU manufacturing industry has put all efforts to make Ubuntu powerful with CUDA to its maximum capacity.

Ubuntu in its latest version 20.04 also gives its user an option to use external graphics cards through thunderbolt adapters. They can add them through dedicated PCI slots too.

So no surprise that all deep learning frameworks like Keras, TensorFlow, OpenCV, PyTorch etc all prefer Ubuntu over all other OS. The world leaders in advanced AI/ML/DL research and development like autonomous car sector, the CERN and LHC, famous brands like Samsung, NVIDIA, Uber etc. all use Ubuntu for their research activities.

Advanced feature and support for Hardware

The support Ubuntu comes with for hardware is also exceptional. Ubuntu provides organization-specific hardware certification which means high compatibility is assured. The hardware has tight integration with BIOS and factory level quality assurance.

To achieve the quality hardware Canonical directly deals with the hardware manufacturers. Canonical develops partnerships with major hardware manufacturers in order to provide an operating system with preloaded and pretested features.

The support team is as usual exceptional anytime ready for any kind of troubleshooting. With all these assurances developers can fully concentrate on their R&D.

Finally the Software

Canonical’s Ubuntu has its own open-source software collection. The software devices all are compatible both at board level as well as component level. All of its versions old or latest contain the same package of software. This feature has several advantages.

The large user base of Linux using different versions of Ubuntu with a seamless experience of switching between them. This becomes possible only because of the same software packages across all the versions. Developers can easily test their applications locally before launching them on the web for global users.

The bunch of open source software makes it possible fast creation of AI models. Creation of software and debugging is fast and easy on IoT hardware before deployment.

The snapcraft tool

Snapcraft, the app store for Linux

It is another major feature of Ubuntu which makes it a clear winner for an ideal programming OS. Snap is a feature for packaging and distribution of containerized applications. The automatic updates in Ubuntu are very safe to install and execute only because of this snap feature.

Snapcraft is a command line tool which creates snaps. This tool makes packaging of applications very easy. The feedback of the users through snapcraft tool has immense importance for the developers. These feedback provides the necessary insights about the software and helps further improvement.

For example, a study made by Canonical revealed that maximum users of Ubuntu never update the software. So, based on this feedback they started to provide automatic updates. Canonical does not need to provide support for older versions. As the complete user base simultaneously moves to the latest version of Ubuntu.

Massive online support base

Being an open-source platform, Ubuntu has a massive online support and documentation repository. Any user anytime can use the service like Slack and Skype to ask their queries. The Ubuntu support group is also very vibrant. Here you can expect a reply from the development team itself.

Even popular question-answer groups like Quora, Reddit etc. also have threads on Ubuntu related queries. I personally got many of my queries already answered there. Even you have some unique problem that has not answered earlier, you can post them in any of these platforms. It is highly likely that within a few hours you will get some really helpful suggestions by either any normal user or the Ubuntu support/development team itself.

Final words

As you finish reading this article you have a clear idea of why you should pick Ubuntu as your machine learning or deep learning programming platform. I have tried my best to put together all the information I got reading many articles online or offline.

I invested a lot of my time researching this topic to be 100% sure before diving deep into the advance learning. It is an important decision no doubt. I had bitter experiences before when I already put a lot of effort into learning a particular application. And then one day due to some limitation I had to backtrack and change that platform or application.

It was quite a rework and wastage of time starting fresh from scratch. And it can be avoided if I had done thorough research in the very beginning. So, I learned my lessons and made no mistakes this time. And hope it will also help you make an informed decision.

So, please let me know if you find the article useful by commenting below. Any queries, doubt, suggestions are welcome. I would try to improve the post further based on your comments.