An introduction to Keras: the most popular Deep Learning framework

An introduction to Keras

In this article, I am going to discuss a very popular deep learning framework in Python called Keras. The modular architecture of Keras makes working with deep learning a very smooth and fast experience. Keras handles all higher-level deep learning modelling part very smoothly in both GPU as well as CPU of your workstation.

So, there is no surprise Keras with TensorFlow is the most popular and widely used deep learning framework. This post will introduce you with this framework so that you feel a little comfortable as you start using it for your deep learning experiments.

As you finish this article you will get some introductory idea about Keras like:

  • What is Keras and why it is so popular in deep learning research
  • Important features of Keras framework
  • The modular architecture of Keras
  • The back ends of Keras application like TensorFlow, Theano and Microsoft’s Cognitive Technology or CNTK in short
  • A brief comparison between the frameworks

So lets start with the very basic question…

What is Keras?

It is an Open-Source library written in Python to build neural network models. Keras is efficient enough to let us conduct deep learning experiments very fast while being very user friendly, extensible with modular architecture.

Keras was developed by Francois Chollet under the research project ONEIROS (Open Ended Electronic Intelligent Robot Operating System). Francois Chollet, an engineer with Google and also the creator of Xception deep neural network model.

Popularity of Keras as a deep learning framework

Keras is distributed under the MIT permissive licence; which makes it freely available even for commercial use. It can be considered as one of the big reasons for its popularity.

See the image below showing the search volume in thousands in Google trend for some popular deep learning frameworks.

Popularity of Keras in deep learning
Popularity of Keras in deep learning

TensorFlow being the default python library for all kind of basic operations is always popular in the machine learning domain. So do the sci-kit learn library, which is very famous for machine learning especially for so many shallow learning algorithms.

But you can see that Keras is also a popular keyword in spite of its very high-end use in neural models. We have to keep in mind that deep learning itself is a very specialized field with very limited use among researchers from specific industries.

Keras was conceptualized for the researchers to conduct deep learning experiments in academia and research setups like NASA, NIH, CERN. The Large Hadron Collider at CERN also used Keras in their search of unknown particles through deep learning.

Besides many famous brands like Google, Uber, Netflix, many startups are also heavily dependent on Keras for their deep learning application in different R&D activities.

The famous machine learning competition website Kaggle has top 5 winning teams using Keras to solve the challenges and the number is steadily increasing.

Important features of Keras

  • The same code can be run in both CPU and GPU and compatible with both Python 2.7 or 3.5
  • The deployment is very easy with the help of full deployment capabilities of TensorFlow platform.
  • Keras models can directly run from browsers by exporting the models into Java script, TensorFlow lite to run on the iOS, android and other devices.
  • Be it a convolutional network or recurrent network, Keras can handle both and even a combination of them.
  • Keras is popular in scientific research as the implementation of arbitrary research ideas is easy due to the low-level flexibility of Keras while providing high-level convenience to speed up the experimentation cycles.
  • Any task related to computer vision or sequence processing is very fast and smooth in Keras framework
  • Keras is capable of building any kind of deep learning model. It may be a generative adversarial network or neural Turing machine, a multi-input/output model etc. etc. Keras will efficiently build them

The modular functionality of Keras

Keras is designed to handle only the high-level model building part in the deep learning ecosystem. It lets the well-optimized tensors to handle the basic lower-level programming part. All the basic tensor operations and transformations are the subjects of specialized tensors managed by TensorFlow library of Google.

The modular functionality of Keras
The modular functionality of Keras

In this modular approach, the Keras used another two backends which are Theano and Microsofts’ CNTK. These are very popular deep learning execution engines. These are not exclusive to use with Keras. You can switch to any of them anytime if you find if it is better and faster to tackle the particular task.

All these three libraries enable Keras to run in both the CPU as well as GPU.

Here are these three backends in brief.


Keras with TensorFlow back-end
Keras with TensorFlow back-end

This is a deep learning library developed by Google to handle all tensor operations smoothly. Based on TensorFlow 2.0, Keras has the potential to provide an industry-standard framework for building large GPUs.

During 2017 Google decided to provide support to Keras in TensorFlow’s core library. Francois Chollet was of the opinion that Keras is built as an interface to perform high end deep learning model building task and not a fully functional machine learning framework.

TensorFlow allows Keras code to run in CPU by wrapping itself into a low-level library called Eigen.


As the mentions ” Eigen is a C++ template library for performing linear algebra: matrices, vectors, numerical solvers and related algorithms”.


And when TensorFlow is to run in the GPU, the library that TensorFlow uses to wrap itself is NVIDIA CUDA Deep Neural Network library (cuDNN). It is a “GPU-accelerated library of primitives for deep neural networks” as mentioned at NVDIA developer website.

The cuDNN library provides highly tuned applications for deep learning functions like forward and backward convolution, pooling, normalization, activation layers etc.


Theano in back end
Theano in back end

It is also a Python library especially for performing several mathematical operations like matrix manipulation and transformations and has tight integration with NumPy. Theano is developed by MILA lab of the University of Montreal, Quebec, Kanada. It is also capable to run on both CPU and GPU.

Theano is capable of creating symbolic graphs for calculating gradient descent.


Keras with  Micosoft's Cognitive Tool  kit as back-end
Keras with Micosoft’s Cognitive Tool kit as back-end

This is Microsoft’s Cognitive Tool Kit formerly known as Computational Network Toolkit. It is a free tool kit which is easy to use but gives commercial grade performance in training deep learning algorithms.

Soon after google decided to provide support to Keras in TensorFlow’s core library in 2017; Microsoft also decided to include CNTK at the back end of Keras.

CNTK helps to “describe the neural network as a series of computational steps via a directed graph”

Out of these three backends of Keras implementation, TensorFlow is the most frequently used one because of its robustness. It is highly scalable too.

A comparison between the frameworks

To conclude this article I will summarize the features of Keras and where it is different from the other popular frameworks. The popular three frameworks TensorFlow, Keras and theano are compared in the table below.

We already know a basic difference between them. In deep learning execution point of view, Keras is the implementation interface whereas the rest of them act as the back end. But there are some other key differences between them. Let’s discuss them point by point

LicenceMIT licenceApache 2-DBSDMIT
Developed byFrancois CholletGoogle BrainUniversity of Montreal, Quebec, KanadaMicrosoft Research
Written inPython C++, CUDA, PythonPythonC++
PlatformsLinux, MacOS, WindowsLinux, MacOS, Windows, AndroidCross platformLinux, MacOS, Windows
APIHigh level APILow-level APILow-level APILow-level API
UseUsed in experimentation with deep learningPopularly used for all kinds of machine learningUsed for multidimensional matrix operationsCUDA support and parallel execution feature
Deep learning typeUsed for all deep learning algorithmsSupports reinforcement learning and other algorithmsNot very smooth performer in AWSA deprecated deep learning framework
Open MP supportYes (with Theano as back end)YesYesYes
Open CL supportYes (With Theano, TensorFlow and PaidML as back end)YesNoy yet, work in progressNo
Support for reinforcement learningNoYesNo No
PerformanceQuite fast with Theano and TensorFlow in back endOptimized for big models, memory requirement may be higher.Run time and memory competitive and support for multiple GPU. Compile time is more. Comparable to Theano and TensorFlow
Additional featuresSupport for Android,
Multi GPU support
Android support, Multi GPU support, support for distributed windows OSMulti GPU supportCross platform support
Difference between Deep Learning frameworks


This is all four frameworks in a glimpse. Better not take this table to compare between these frameworks as they are not working independently, rather play together to make up for other’s weaknesses. In this way, we got a really powerful Deep Learning execution engine.

So, I hope this article has been able to give you a very introductory idea about Keras as a deep learning framework. This information is just to make you feel comfortable before you take the first step.

As you enter real deep into the deep learning applications, you will automatically find out more interesting facts as well as points presented here will make more sense to you.

If you find this article helpful please let me know through comments below. In case anything I have missed here or there is any question or doubt, any suggestions please put them also in comments. I would like to answer them.

A detailed discussion on tensors, why it is so important in deep learning?

A detailed discussion on tensors

This article is all about the basic data structure of deep learning called Tensors. All inputs, outputs and transformations in deep learning are represented through tensors only. Depending on the complexity of the data tensors with different dimensions play the role of the data container.

So, it goes without saying that to improve deep learning skill if you must be confident in your knowledge of tensors. You should be fluent with its different properties and mathematical treatment. This article will help you to get introduced to tensors. As you finish this article, you will be thorough with the following topics:

  • What are tensors
  • Properties of tensors like dimension, rank, shape etc.
  • Use of tensors in deep learning
  • Real-life examples of tensor application

The importance of tensors can be understood by the fact that Google has created a complete machine learning library namely Tensorflow on tensors. So, in this article, I will try to clear the basic idea about tensor, different types of tensors, their application with executable python code.

Tensors with different dimensions
Tensors with different dimensions

I will also try to keep it as simple as possible. The mathematical parts will also be presented with the help of python scripts. As it will be much easier to understand for those with no or little mathematical background. Some basic knowledge of matrics will certainly be beneficial for quick learning.

So let’s start the article with the most obvious question;

What is a Tensor?

Tensor is nothing but a container of data. It works the same as matrics do for NumPy. In tensor terms, a matrix is a two dimensional (2-D) tensor. In a similar way, a vector is a one-dimensional tensor whereas a scalar is a zero-dimensional tensor.

When we deal with an image, then it has three dimensions like height, weight and depth. So a 3-D tensor is required to store an image. Likewise, when there is a collection of images, another dimension of no. of images gets added. So, now we will need a container with four dimensions. A 4-D tensor will serve the purpose. To store videos 5-D tensors are used.

Generally in neural networks, we need to use tensors up to four dimensions. But it can go up to any dimensions depending on the complexity of the data. The NumPy matrices can be thought of a general form of tensors with any arbitrary dimensions.

Scalar data

These are tensors with zero dimension. Data types like float 32, float 64 are all scalar data. These scalar data has rank zero as they have zero axes.  Python’s ndim attribute can display the number of axes of any data structure. See the following code applied to a scalar data structure.

Scalar data as tensors
Scalar data as tensors

You can try these simple codes and check the results. If you are just getting familiar to python compiler it can be a good start.

Vector data

These are one-dimensional (1-D) tensors. So the rank is one. It’s often confusing differentiating between an n-Dimensional vector with n-Dimensional tensor. So for example if we consider the following vector

\dpi{200} \begin{bmatrix} 1,6,7,9,10,8 \end{bmatrix}

It is a six dimension vector with one axis, not a 6-D tensor. A 6-D tensor will have 6 axes with any number of dimensions along each of the axes. 

A vector: 1-D tensor
A vector: 1-D tensor


These are 2-D tensors with two axes. A matrix has rows and columns hence two axes and rank is two. Again we can check this with the ndim attribute. Let’s take a NumPy matrix of size (3,4) which means the matrix has 3 rows and 4 columns.

\dpi{200} \begin{bmatrix} 2,3,3,9 \\ 4,10,5,8\\ 4,6,9,2 \end{bmatrix}

So, lets check its rank in the same way as we did in case of scalar and vector:

A matrix: 2-D tensor
A matrix: 2-D tensor

While you are writing the codes, be extra cautious with matrix input. Often the braces open and closing cause errors.

Tensors with higher dimensions

As I have mentioned at the beginning, tensors commonly we use have dimensions up to four. But in case of video data the dimensions can go up to five. We can easily understand data structures of dimensions up to two. But when it goes beyond that, it becomes a little difficult to visualize them. 

3D tensors

In this section we will discuss some high dimensional tensors and the way they store data. So, let’s start with 3-D tensors. Let’s consider the following tensor and try to identify its three dimensions.

A matrix with 3 axes
A matrix with 3 axes

You can see it is actually a data structure containing three matrices each with 3 rows and 4 columns. See this image to understand the shape of this tensor. Let’s create a variable to store the data and check its rank with ndim attribute.

# High dimensional tensors

See the output below to understand the structure of a 3-D tensor. It is a collection of matrices. Thus unlike a single matrix with two axes a 3-D tensor has three axes.

3-D tensor
3-D tensor

4-D tensors

The same way we get a 3-D tensor, if some of such 3-D tensors are to be grouped then another dimension gets created making the tensor a 4-D tensor. See the image for a hypothetical 4-D tensor. Here you can see three cubes are clubbed. Such 4-D tensors are very useful for storing images for image recognition in deep learning.

In the same fashion we can have more higher dimension tensors. Though tensors up to 4 dimension are more common, some times to sore videos 5-D tensors are also used. Theoretically there is no such limitation in dimension. For the sake of data storage in an organized manner any n number of dimensions can be used.

5-D tensors

This is the type of tensors when we need to store data with yet another dimension. Video data can be an ideal example where 5-D tensors are used.

If we take an example of a 5-minute video of 1080 HD resolution, then what will be the dimension of its data structure? Let’s calculate in a simple way. The pixel size will be 1080 x 1920 pixels. The time duration of the video in seconds is 5 x 60=300 seconds.

Now if the video is sampled by 10 frames/second then a total number of frames will be 300 x 10=3000. Suppose the colour depth of the video is 3. So for this video, the tensor should have 4 dimensions, the shape is (3000, 1080,1920,3).

So, the single video clip is a 4-D tensor. Now if we want to store multiple videos, say 10 video clips with 1080 HD resolution, then we need 5-D tensors. The shape of this 5-D tensor will be (3000, 1080,1920,3,10).

This is an enormous size of the video content. If we want to use such a huge data directly in deep learning, the training process will be never ending. So, such kind of data needs size reduction and several preprocessing steps before using as input in neural network.

Shape of tensors

 This is a concept to mention a number for the length of each axis of a tensor. These are a tuple of integers indicating the length of each dimension.

A vector has only one axis/dimension so the shape is just a single element. The vector we used here as an example has 6 elements so its shape is (6,).

The matrix as 2-D tensor we discussed above has the shape (3,4). As it consists of 3 rows and 4 columns.

Likewise in case of a 3-D tensor, the shape tuple will contain the length of all its three axes. For the example we took here has shape (3,3,4). See the image below to visualize it.

Shape of a 3-D tensor
Shape of a 3-D tensor

Again the 4-D tensor we took as an example above has a shape (3,3,7,4) as it groups three separate cubes together. The image below presents a higher dimension figure to understand its dimensions and shape.

Shape of a 4-D tensor
Shape of a 4-D tensor

Real life examples of tensors as data container

So, as of now, I think the basics of tensor is clear to you. You know as a data structure how a tensor stores the data. We took small examples of commonly used data structures as tensors.

But in general, the tensors used for real-life problem solving are much more complex. The deep learning used for image recognition often deals with thousands of images stored in the database. So, which data structure should we use to handle such complex data? Here comes tensors to rescue us.

The MNIST data set with handwritten digits

Let’s take a real-life example of such image database. We will take the same MNIST data we used for handwritten digit recognition in an earlier blog post. It is an image database storing 60000 images of handwritten images. And it is effectively stored in a 3-D tensor with shape (sample_size, height, weight).

Let’s load the database. It is a default database in the Keras library.

#Loading the MNIST data set 
from keras.datasets import mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

The python codes we applied before in this article can be applied here too to check the number of axes of the training data set.

# Checking the axes of train_images

And the above line of code will return the rank as 3. Again we will check the shape of the data structure storing the whole database.

# Checking the shape of the tensor

The shape it will return is (60000,28,28). Which suggests that it has 60000 images each with size 28×28 pixels.

Lets check an image from the data set. As I have mentioned the data set contains hand written image of digits and it is a classic example data set for feature recognition. Although it is essentially a solved data set, still many deep learning enthusiasts use it to test their new model’s efficiency.

So here is the code for printing the 10th digit from this data set. We will use the pyplot module from matplotlib library.

# Printing the 10th image from the MNIST data set
import matplotlib.pyplot as plt

The output of the above code will be 10th image of a handwritten digit. See below if you recognize it 🙂

Sample from MNIST data set
Sample from MNIST data set

Stock price data

In the Indian stock market the price of each stock changes every minute. A particular stock’s high, low and final stock price for each minute of a trading day is very important data for the traders.

See, for example, the candlestick chart of a particular stock on a particular trading day. The chart shows the stock price behaviour from 10:00 AM to 3:00 PM. That means a total of 5 hours of the trading day i.e. 5 x 60=300 minutes.

So the dimension of this data structure will be (300,3). That makes it a 2-D tensor. This tensor stores a stock’s high, low and final price for a particular day.

Candlestick chart of stock prices
Candle stick chart of stock prices

Now if we want to store the stock’s price for the whole week? Then the dimension will become (trading_week, minutes, stock_price); if on that week there are 5 trading days, then (5,300,3). That makes it 3-D tensor.

Again if we want to store a number of stocks price for a particular week? say for 10 different stocks? So another dimension gets added. It becomes a 4-D tensor with shape (trading_week, minutes, stock_price, stocks_number) i.e. (5,300,3,10).

Now think of mutual funds, which are the collection of stocks. So if we consider someone’s mutual fund portfolio having different mutual funds, then to store the high, low and final price of all the stocks of that portfolio for a whole trade week we will need a 5-D tensor. The shape of such tensor will be (trading_week, minutes, stock_price, stocks_number, mutual_funds).


So, tensor and its different properties are now clear to you. I know there are some new terms and for the first time, these may appear a little confusing. So here in the below table, I have once again summarised them for a quick revision.

Type of tensorUsesRank/axesShape
0-D TensorStoring single value0Single element
1-D TensorStoring vector data1Length of an array
2-D TensorStoring data in matrices2Rows, columns/samples, features
3-D TensorTime series, single image3Width, height, colour depth (in case of an image)/ samples, time lags, features (in case of time series)
4-D TensorStoring Images4Width, height, colour depth, no. of images/ samples, channels, height, width
5-D TensorStoring videos5Sample, frame, height, width, channels
Different types of tensors

Please also refer to the articles mentioned in the reference for further reading. These are also very informative articles and you can brush up your knowledge.

Hope that you have found the article helpful. If you have any questions or doubts regarding the topic, please put it in comments below. I would like to answer them.

Follow this blog for forthcoming articles where I am going to discuss more advanced topics on tensors and deep learning in general.

Also if you liked the post, subscribe it so that you can get the notifications whenever new blogs are added.


How to develop a deep learning model for handwritten digit recognition?

Developing deep learning model for handwritten digit recognition

This article describes how to develop a basic deep learning neural network model for handwritten digit recognition. I will take a very basic example to demonstrate the potential of a deep learning model. Though this example you will get some elementary idea about the following key points:

  • How deep learning performs the task of image recognition,
  • The process of building a neural network,
  • Basic concepts of neural network layers
  • Testing and validation its performance and
  • Creating some diagnostics for its evaluation.

Even if you have no prior exposure to programming or statistical/mathematical concepts, you will not face any problem understanding the article.

Deep learning model has its most prominent use in feature recognition. Especially in image recognition, voice recognition and lots of other fields. In our daily lives, we frequently see use of biometric identification of individuals, pattern recognition in smartphones, fingerprint scanner, voice-assisted search option in digital devices, chatbots answering your questions without any human intervention, weather prediction tools etc. etc.

So, it is a very pertinent question to ask “how can we develop a deep learning model which can perform a very simple image recognition task?”. So let’s explore it here.

Develop a deep learning model

All of the applications mentioned above make use of deep learning. It learns from the historic labelled data. The model uses this labelled data to learn the pattern and then predict a result for a new feature as accurately as possible. So, it is a learning process and it’s deep because of the layers in it. We will see here a deep learning model learns the pattern to match handwritten digits.

I will discuss every section with its purpose. And hope that you will find it interesting how deep learning works. I have chosen this example because of its popularity. You will find many book chapters and blogs on it too.

I have mentioned all such sources I referred to write the python code to develop deep learning model for handwritten digit recognition at the end of this article in “reference”. You can refer them to further enrich your knowledge. All of them are a very good source of information.

As you finish reading the article you will gain the basic knowledge of a neural network, how it works and the basic components of it. I will be using a popular Modified National Institute of Standards and Technology data set or in short MNIST data. 

The MNIST data

It is an image data set of handwritten digits between 0 to 9. The greyscale images are with the resolution of 28×28 pixels; 60000 images for training and 10000 images for testing. This is a very common data set used to test any machine learning model.

In this article, we will build the model writing python code. More importantly, discuss the particulars and components of a neural network model. So that in the process of building the model you got a clear understanding of each step and develop confidence for its further application.

So let’s start coding:

Calling the basic libraries to develop the deep learning model

These are the very basic python libraries required for different tasks.

# Importing required basic libraries
from numpy import mean
from matplotlib import pyplot
from sklearn.model_selection import KFold
from numpy import std
from keras import models
from keras import layers

Loading the data

The MNIST data is a default data set in the Keras library. So, you only need to load it with the following code. Four NumPy arrays are there to store the data. You can check the array types to confirm.

#Loading the data set (MNIST data is pre included in Keras)
from keras.datasets import mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

The following line of codes will produce a sample set of images from the MNIST data set.

# Example of first few images of MNIST data set
for i in range(9):
	# define subplot
	pyplot.subplot(330 + 1 + i)
	# plot raw pixel data
	pyplot.imshow(train_images[i], cmap=pyplot.get_cmap('gray'))
# show the figure

See the grey-scale images below to get an idea of the handwritten digit images.

First few handwritten images from MNIST data set
First few handwritten images from MNIST data set

The training and testing data set

The following code describes the training data set. It consists 60000 sample data from the whole MNIST data set. The images are of 28×28 pixel size. And the particular sample rows comprising the training set.

#The training data set
print("Training data shape: ",train_images.shape)
print("Size of training data:",len(train_labels))
print("Train labels:",train_labels)
Description of training data
Description of training data

In the same way the following code presents the description of testing data set. Which has 10000 sample data.

#The testing data set
print("Testing data shape: ",test_images.shape)
print("Size of test data:",len(test_labels))
print("Test labels:",test_labels)
Description of test data
Description of test data

Building the model

In this deep learning neural network model notice the specification of layers. These layers are the main component in any neural net model. The neural network performs data distillation process through these layers. The layers act as sieves to refine the output of the neural network.

The first layer receives the raw inputs and passes them to the next layer for higher-order feature recognition. The next layer does the same with more refinement to identify the more complex feature. Unlike machine learning, deep learning is capable of identifying which feature to identify in each step.

In this process gradually the model output matches closely with the desired output. The feature extraction and recognition process continue with several iterations until the model’s performance is satisfactory.

network = models.Sequential()
network.add(layers.Dense(512, activation='relu', input_shape=(28 * 28,)))
network.add(layers.Dense(10, activation='softmax'))

Dense layers

The neural net here consists of dense layers. Which means the layers are fully connected. There can be many layers depending on the complexity of the identifiable feature. In this case, only two such layers have been used.

The first layer has an input size 512. The input shape is 28*28 which is actually the pixel size of each grey-scale image. The second layer, on the other hand, is a 10-way softmax layer. Which means that the layer has an array of 10 probability score totalling equal to 1. Each of these probability corresponds to the probability of each handwritten image belonging to any of the 10 digits from 0 to 9.

Data preprocessing

A neural network expects data values with an interval [0,1]. So, the data here needs some preprocessing before being used by the deep learning model. It can be done by dividing each value by the maximum value of the variable. But to do that we need to convert the data type from unsigned integers to floats.

The transformation used here actually converts the data type from Unit8 to float32. In this process the data value range changes from [0,255] to [0,1].

# Reshaping the image data to bring it in the interval [0,1]
train_images = train_images.reshape((60000, 28 * 28))
train_images = train_images.astype('float32') / 255
test_images = test_images.reshape((10000, 28 * 28))
test_images = test_images.astype('float32') / 255

Compilation of the model

The compilation of the model comprises some critical terms like “loss”, “metrics”, “optimizer”. They have special function while compiling the model. Here the loss function used is categorical crossentropy. Its a function to estimate the error model produces and calculate the loss score. This function is best suited for multi-class classification like in this case with 10 digit classes.

Depending on this loss score, the optimizer function optimizes the weight of the layers. Its kind of parameter adjustment of the model. This process goes on until the model achieves an acceptable level of estimation.

#Compilation step

The metrics the model uses here is “accuracy”. It suggests the goodness of fit of the model. Quite obviously, lesser the value of accuracy, better is the model.

# Encoding the labels into categories
from keras.utils import to_categorical
train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)

Model fitting

Now its time to fit the model with the training data. For fitting the model the number of epochs is 5 and batch size is 128. Which means the process repeats itself 5 times with 128 samples in each cycle.

# Fitting the model, train_labels, epochs=5, batch_size=128)

Now in the output, we can see two metrics loss and accuracy. These two parameters tell us how the model is performing on the training data. From the last epoch, we can see that the final accuracy and loss the model achieves are 0.9891 and 0.0363 respectively. Which is quite impressive.

Model fitting
Model fitting

Testing the model

We achieved a training accuracy for the model as 0.9891. Let’s see how the model performs with the testing data.

# Testing the model with testing data
test_loss, test_acc = network.evaluate(test_images, test_labels)
print('test_acc:', test_acc)

The test accuracy is 0.9783. Though it’s quite good although there is a drop in the accuracy from training. When the model’s performance is better in training and the accuracy is less while testing, it may be an indication of the model’s overfitting.

Accuracy of the test
Accuracy of the test

Evaluating the model

Now we know how to develop a deep learning model but how can we evaluate it? This step is as important as building the model.

We are here using the MNIST data as a practice example for the application of the network model. So, although the MNIST data is an old data set and effectively solved, we will like to evaluate our model’s performance with this data. In order to do that we will apply the K-fold cross-validation process.

In this evaluation process, the data set is divided into k-groups. Then the model fit process is repeated k times for each of the groups. Thus the name k-fold cross-validation. Except for the data in the particular group, the rest of the data is training data and the group data is used to test the model.

To perform this evaluation task we will develop three separate functions. One function for performing the k-fold cross-validation, one is to make some plots showing the learning pattern of all the validation steps and another function for displaying the accuracy scores.

Finally all these three functions will be called in the evaluation function to run each of them. The neural network model developed above has been used in this evaluation process.

Writing the function for k-fold cross validation

This is the first function for performing the validation process. We take 5 fold validation here as lesser than this number will not be enough to validate and again too big a number will take long time to complete, so it is a kind of trade of between the two.

The kfold module from scikit-learn library will automatically randomize the sample values in different groups and split them to form test and validate data.

# k-fold cross-validation
def evaluate_nn(X, Y, n_folds=5):
	scores, histories = list(), list()
	kfold = KFold(n_folds, shuffle=True, random_state=1)
	# data solitting
	for train_ix, test_ix in kfold.split(X):
		# train and test
		train_images, train_labels, test_images, test_labels = X[train_ix], Y[train_ix], X[test_ix], Y[test_ix]
		# fit model
		history =, train_labels, epochs=5, batch_size=128, validation_data=(test_images, test_labels), verbose=0)
		# model evaluation
		_, acc = network.evaluate(test_images, test_labels, verbose=0)
		print('> %.3f' % (acc * 100.0))
		# scores
	return scores, histories

Creating plots for learning behaviour

Here we create functions to plot the learning pattern through the evaluation steps. Two seperate plots will be there. One for the loss and the other for accuracy.

def summary_result(histories):
  for i in range(len(histories)):
    # creating plot for crossentropy loss
    pyplot.subplot(2, 2, 1)
    pyplot.title('categorical_crossentropy Loss')
    pyplot.plot(histories[i].history['loss'], color='red', label='train')
    pyplot.plot(histories[i].history['val_loss'], color='green', label='test')
    # creating plot for model accuracy
    pyplot.subplot(2, 2, 2)
    pyplot.title('Model accuacy')
    pyplot.plot(histories[i].history['accuracy'], color='red', label='train')
    pyplot.plot(histories[i].history['val_accuracy'], color='green', label='test')

Printing accuracy scores

It is for printing the accuracy scores. We will also calculate the summary statistics like mean and standard deviation of the accuracy scores. And finally will plot a Box-Whisker plot to show the distribution.

def acc_scores(scores):
	# printing accuracy
 and its summary
	print('Accuracy: mean=%.3f std=%.3f, n=%d' % (mean(scores)*100, std(scores)*100, len(scores)))
	# Box-Whisker plot

Finally the evaluation function

This is the final evaluation module where we will use all the above functions to get the overall evaluation results.

def evaluation_result():
	# model evaluation
	scores, histories = evaluate_nn(train_images, train_labels)
	# train and test curves
	# model accuracy scores
 # run the function

Here 5-fold cross-validation has been used. As the whole data set contains 60000 sample data so each of these k groups will have 12000 sample data.

Evaluation results

See below the accuracy for all the 5-fold evaluation steps. For the first step the accuracy is 98% and for all the rest four occasions the accuracy scores are above 99%. Which is quite satisfactory.

Accuracy score for k-fold cross validation
Accuracy score for k-fold cross validation

We have also produced some diagnostic curves to display the learning efficiencies in all five folds. The following plots show the learning behaviour by plotting the loss and accuracy of the training processes in each fold. The red line represents the loss/accuracy of the model with the training data set whereas the green line represents the test result.

You can see that except for one case the test and training lines have converged.

Plots to show learning behaviour through cross validation steps
Plots to show learning behaviour through cross validation steps

The below Box-Whisker plot represents the distribution of the accuracy mean across all folds for ease of understanding.

Distribution of the accuracy score over all the  evaluation steps
Distribution of the accuracy score over all the evaluation steps

The mean and standard deviation of all the accuracy score are as below:

Mean and standard deviation of accuracy
Mean and standard deviation of accuracy


So, now you know how to develop a very basic deep learning model for image recognition. Here you have just started with deep learning models and many things may be new to you especially if you don’t have any programming background. But not to worry with more practice things will improve.

You can reproduce the result using the code from here. But it is completely okay if you don’t understand all the steps of the coding part. Take it as a fun activity of python programming as well as deep learning. 

A very necessary step for learning neural network or in general any new thing to get yourself emerge in that field and arouse deep interest in it.  I hope that this blog will serve this purpose. This is not to overwhelm you with the complexity of the technique rather demonstrate you about its potential in the simplest and the most practical way.

So, what are you waiting for? Start writing your first deep learning program taking the help of this article. Make small changes and see how the output changes.

Don’t get frustrated if it gives an error (which will be very common while doing your first programming). Try to explore the reason for the error. Take help of the StackOverflow site to know about the error. You will get an answer there for almost all of your queries. And this is the most effective way to learn it. 

Please, don’t forget to comment below if you find the article useful. It will help me to plan more such articles.


  • Chollet, F., 2018. Deep Learning mit Python und Keras: Das Praxis-Handbuch vom Entwickler der Keras-Bibliothek. MITP-Verlags GmbH & Co. KG.

Machine learning vs. data science: how they are different?

Machine learning vs data science

Machine learning and data science are two major key words of recent times almost all fields of science depend on. If data science is inevitable to explore the knowledge hidden in the data then machine learning is something bringing evolution through feature engineering. But the question is are they very different? In this article, these two fields will be discussed point by point where they are different and if there are any similarities.

The Venn Diagram of sciences

I got a good representation of how data science overlaps with the machine learning domain through Venn diagram from this website. Drew Conway in 2010 gave this concept. 

Venn diagram showing relation between data science and machine learning
Venn diagram showing relation between data science and machine learning

Now with these Venn Diagram structure, the association of all these fields are pretty clear. The lowermost circle essentially indicates the domain knowledge of a particular field. For example, it may be a field of agriculture crop production or population dynamics etc. A data scientist should know about his particular domain besides core knowledge of programming and statistics/mathematics.

Further you can see that data science is common to all three domains. Whereas machine learning lies in the intersection of statistics, mathematics knowledge and the sphere of hacking skill. The major difference between these two lies here. Data science being a more broad concept, requires special subject knowledge to analysis. Where as Machine Learning is more coding and programming oriented field.

Lets dive into elaborate discussion of these differences…

Domain differences

To start with let’s be clear about their domains. Data science is a much bigger term. It comprises multiple disciplines like information technology, modelling and business management. Whereas machine learning is comparatively a specific terms common in data science where the algorithm learns from the data. 

Unlike data science, machine learning is more practical than empirical. Data science has much more extensive theoretical base and amenable to mathematical analysis. machine learning, on the other hand, is mainly a computer program based needs coding skills.

Lets first discuss these two fields first.

Machine Learning

As we have seen in the above Venn Diagram, data science and machine learning have common uses. Data science uses the tools of machine learning to study transactional data for useful prediction. Machine learning helps in pattern discovery from the data.

Machine learning is actually learning from the data. The historical data trains the machine learning algorithms to make an accurate prediction. Such a learning process is called supervised learning

There are situations where no such training data available. Then there are some machine learning algorithms which works without training. This type of machine learning is known as unsupervised machine learning. Obviously the accuracy here is less than the supervised one. But here the situation is also different.

Another kind of machine learning is known as Reinforcement learning. This one is the most advanced and popular machine learning. Here is also the training data is absent and the algorithm learns from its experience. 

Deep learning is again a special field of machine learning. Lets discuss briefly about it too.

Deep learning

Deep learning is a subfield of Machine learning which is again a subfield of Artificial Intelligence. In this context deep learning deals with the data as machine learning does; the difference lies in the learning process. Scalability is also a point where these two processes are different from each other. 

Deep learning especially a superior method when the data in hand is very vast. Deep learning is very efficient in taking benefit of the large data set. Contrary to machine learning models or other conventional regression models, where the models’ accuracy does not increase after a certain level. The deep learning algorithm goes on improving the model by training it with more and more data.

The deep learning process is a black box method. That means we will only see the inputs and the output. What is going on in between, how does the network work remains obscure. 

The name deep learning actually refers to the hidden layers of the training process. The backpropagation algorithm takes the feedback signal from the output to adjust the weights used in the hidden layers and refines the output in the next cycle. This process goes on until we get a satisfactory model.

Data science

We can consider data science as a bridge between the traditional statistical and mathematical science and their application to solving real-world problems. The theoretical knowledge of basic sciences many times remains unused. Data science makes this knowledge applicable to solve practical problems. 

More lightly, we can say that a data scientist must have more programming skill than most of the scientists and more statistical skill than a programmer has. No surprise that just mention of data science in anyone’s CV makes him eligible for an enhanced pay package. 

Since in almost all organizations are generating data in an exponential amount, they need data scientists to get meaningful insights out of that. Moreover, after the explosion of internet users, the data generated online is enormous. Data science applies data modelling and data warehousing to keep track of this ever-growing data.

Necessary skills to be a data scientist

A data scientist needs to be proficient in both theoretical concepts as well as programming languages like R and Python etc. One person with a good understanding of the underlying statistical concepts can only develop a sound algorithm for its implementation.

But a data scientist’s job does not end here. These two core subjects knowledge is essential no doubt. But to become a successful data scientist a person must provide a complete business solution. When any organisation appoints some data scientist, they are supposed to analyse the data to gain insights about the potential business opportunities and provide the roadmap. 

So, a data scientist should also possess knowledge of the particular business domain and communication skill. Without effective communication and result interpretation, even a good analytical report may lead to a disappointing result. So none of these four pillars of success is less important.

The four pillars of data science

I got a good representation of these four pillars of data science through Venn diagram from this website which was originally created by David Taylor a Biotechnologist in his article “Battle of Data Science Venn Diagrams”. 

Four pillars of data science
Four pillars of data science

These four different streams are considered as the pillars of data science. But why so? Let’s take real-world examples to understand how data science plays an important role in our daily lives.

Example 1: online shopping

Think about your online shopping experience. Whenever you log in your favourite online shopping platform you get deals on items you like. Items are organized according to your interest. Have you ever thought how on the earth the website does that?

Every time when you visit the online retail website search for things of your interest or purchase something; you generate data. The website stores historical data of your interaction with the shopping platform. If anyone with data science skill analyses the data properly he may know about your purchase behaviour even better than you.

Example 2: Indian Railways

Indian Railways is the fourth-largest network in the world. Every day thousands of trains are operated through which crores of passengers travel across the country. It has a track length over 70,000 km. 

So, quite such a vast network generates a huge amount of data every day. The ticket booking system, train operation, biometrics, crew management, train schedule in every aspect the data generated is big data. And if we consider the historical data is no less than a gold mine of information on Indian passengers’ travel trend over the years.

Application of data science on this big data reveals very important information to enable the authority to take accurate decisions about during which season there is a rush of passengers and additional trains need to run; which routes are profitable, running special trains and many more. 

So in a nutshell, the main tasks of data science are:

  • Filtering the required data from big data
  • Cleaning the raw data to make it amenable to analysis
  • Data visualization
  • Data analysis
  • Interpretation and valid conclusion


As we discussed all of them at length, we came to know that in spite of many similarities these two subjects have some differences in their application. So, now its time to point out the specific differences between machine learning and data science. Herre they are:

Data scienceMachine Learning
Based on extensive theoretical concepts of statistics and mathematicsKnowledge of computer programming and computer science fundamentals are essential
Generally performs various data operationsIt is a subset of Artificial Intelligence
Gives emphasis on data visualizationData evaluation and modelling is required for the feature engineering
It extracts insights from the data by cleaning, visualizing and interpreting dataIt learns from data and finds out the hidden pattern
Knowledge of programming languages like R, Python, SAS, Scala etc. is essentialKnowledge of probability and statistics is essential
A data scientist should have knowledge of machine learningRequires in-depth knowledge of programming skills
Popular tools use in data science are like Tableau, Matlab, Apache Spark etc.Popular tools used in machine learning are like IBM Watson studio, Microsoft azure ML studio etc.
Structured and unstructured data are the key ingredients Here statistical models are the key players
It has its applications in fraud detection, trend prediction, credit risk analysis etc.Image classification, speech recognition, feature extraction are some popular application of machine learning
Difference between data science and machine learning


To end with I would like to summarize the whole discussion saying that, data science is a comparatively newer field of science and of great demand across the organizations. Mainly because of its immense power of providing insights analyzing big data which otherwise has no meaning to the organisations.

On the other hand machine learning is an approach which enables the computer to learn from the data. A data scientist should have the knowledge of machine learning in order to unravel its full potential. So, they do have some overlapping parts and complimentary skills.

I hope the article contains sufficient discussion to make you understand the similarity as well as difference between machine learning and data science. If you have any question, doubt please comment below. I would like to answer them.

Evolution of Deep Learning: a detailed discussion

Evolution of Deep learning

The evolution of deep learning has experienced many ups and downs since the last few decades. One time it rose to the pick of the popularity, expectations were high and suddenly some setbacks in experimental trials created a loss of confidence and disappointments. This article will cover this journey of deep learning neural network from its inception to its recent overwhelming popularity.

Background of Machine learning

This all started with very basic concepts of probabilistic modelling. These are very elementary statistical concepts from the school syllabus. This was the time even before the invention of the term machine learning. All models, functions were solely crafted by the human mind. 

Probabilistic models

These models are the first step towards the evolution of deep learning. These models are developed keeping in mind real-world problems. Variables having relationships between them. Combinations of dependent and independent variables were used as inputs to these functions. These models are based on extensive mathematical theory and more empirical than practical.

Some popular such probabilistic models are as below:

Naive-Bayes classification

It is basically the Bayes theorem with some naive assumption hence the name. The concept of this modelling was established long back during the 18th century. The assumption here is “all the features in the input data are all independent”. 

For example, suppose a data set has the data of some persons with or without diabetes disease and their corresponding sex, height and age. Now the Naive Bayes will assume that there is no correlation between all the features between sex, height and age and they contribute independently towards the disease. This assumption is called class conditional independence.

So how this assumption helps us to calculate the probability? Suppose, there is a hypothesis H which can be true or false and this hypothesis gets affected by an event e. We are interested to calculate the probability of the hypothesis being true given that the event is observed. So, we need to calculate: P(H|e)

According to Naive Bayes’ theorem

Here, P(H|e) is called the posterior probability of the hypothesis with the information of event e and can not be easily computed. So, we need to break down it as in equation 1. Now we can calculate each of the probabilities separately from the frequency table and calculate the posterior probability. 

You can read the whole process of calculation here.

P(H) is the prior probability of the hypothesis before observing the event.

Logistic regression

This regression modelling technique is so basic and popular for almost all classification problems that it can be considered as the “Hello World” of Machine Learning. Yes, you have read it right. It is a process for classification problems. Don’t let the word regression in the name misguide you. 

It is originally a regression process which becomes a classification process when the process involves a decision threshold for the prediction. Deciding a threshold for the classification process is very important and tricky one too.

We need to decide the decision threshold depending on the particular case in hand. There can be four types of responses in case of classification problems which are “true positive”, “true negative”, “false positive” and “false negative” (read details about them here). We have to fix the probability of one type of occurrence while reducing another depending on its severity.

Example and basic concept 

For example, take the case for a severe crime and it is to decide if the person should be hanged or not. It is a problem of binary classification with two outputs guilty or not guilty. Here the true positive case is the person found guilty when he actually has committed the crime. On the other hand, the true negative is the person found guilty when he has not committed the crime.

So, no doubt the true negative case here is of very serious type and should be avoided at any cost. Hence while fixing the decision threshold, you should try to reduce the probability of true negative while fixing the probability of true positive cases.

Unlike linear regression predicting the response of a continuous variable, in logistic regression, we predict the positive outcome of a binary response variable. Unlike linear regression which follows a linear function, a logistic regression has a sigmoid function.

The equation for logistic regression:

Equation for logistic regression

Initial stages of evolution of Deep Learning

Although the theoretical model of deep learning came in 1943 by Walter Pitts, a logician and Warren McCulloch, a neuroscientist. The model was called McCulloch-Pitts neurons and still regarded as a fundamental study on deep learning.

The first evidence of the use of neural networks in some toys for children made during the 1950s.  The same year the legendary mathematician Alan Turing proposed the concept of Machine Learning and even gave hints about the genetic algorithm in his famous paper “Computing machinery and intelligence”.

Alan Turing
Alan Turing (Image source:

In 1952, Arthur Samuel first time coined the term Machine Learning. He is known as the father of machine learning. He with his association with IBM also developed the first machine learning programme.

The perceptron: the perceiving and recognizing automaton” a research paper published in the year 1957 by Frank Rosenblatt set the foundation of Deep Learning network.

In 1965 mathematician Alexey Ivakgnenko and V.G. Lapa arguably developed the first working deep learning network. Ivakgnenko for this contribution is considered as the father of deep learning by many.

The first winter period

The period between 1974-80 is considered as the first winter period. It is a long rough period faced by AI research. A critical report submitted by Professor Sir James Lighthill on AI research as asked by UK parliament played a major role to initiate this period.

The report was very critical about the AI research in the United Kingdom and was in the opinion that nothing has been done in the name of AI research. All expectations about AI and deep learning were all hype; creation of a robot was nothing but a mirage; such comments were very disappointing and resulted in the /retraction of research funding for most of the AI research.

Invention of Backpropagation algorithm

Then in during 1980, the famous Backpropagation algorithm with Stochastic Gradient Descent (SGD) was invented for training the neural network. This can be considered as path-breaking discovery as far as deep learning is concerned.  These algorithms are still the most popular among deep learning enthusiasts. This algorithm only led to the first successful application of Neural Network.


Come in 1989 we got to see the first real-life application of Neural Net. It was Yann LeCun who made this possible through his tireless effort in Bell Labs to combine the ideas of Backpropagation and Convolutional neural network.

Yann LeCun
Yann LeCun

The network was named after LeCun as LeNet. It found its first real-world problem-solving use in identification of handwritten codes. It was so efficient in identifying the codes that United States Postal Service adopted this technology in 1990 for identifying the digits of ZIP codes on the mail envelopes. 

Yet another winter period; however brief one

In spite of the success achieved by LeNet, in the same year 1990, the advent of Support Vector Machine pushed the Neural Network almost extinction. It gained very fast popularity mainly because of its easy interpretability and state of the art performance. 

It was also a technology came out of from the famous Bell Labs. Vladimir Vapnik and Corinna Cortes pioneered its invention. They started working on it long back in 1963. It’s their continuing effort that resulted in the revolutionized Support Vector Machine of 1990.

Support Vector Machine: a new player in the field

This new modelling is mainly conceptualized on a kernel trick to calculate the decision boundary between two class of variables. Except for a few cases, it is very difficult to discriminate variables on a two-dimensional. It becomes far easier to understand in a higher-dimensional space. A hyperplane of higher dimensional space becomes a hyper line in two-dimensional space i.e. a straight line. This process of transforming the mode of representation is known as kernel trick. 

Below is an example of what I mean to say by higher dimension representation for classification.

SVM: data representation in higher dimension
SVM: data representation in higher dimension

In figure A, two classes of observations that are red and blue classes are classified using a hyperline. It is a straight forward case and the classification is easy. But consider the figure in B here a straight line can not classify the points.

As a new third axis has been introduced in figure C, we can see that the classes are now can be easily separated here. Now how it will look if the figure we again convert it to its two-dimensional version? see the figure in D.

So, a curved hyperline has now separated the classes very effectively. This is what a support vector machine does. It finds a hyperplane to classify the points and then any new point gets its class depending on which side of the hyperplane it resides.

Kernel trick

A kernel trick can be explained as a technique to maximize the margin between the hyperplane and the closest data points. It makes the process very easy by curtailing the need to calculate the new coordinates in the new representation space. The kernel function only calculates the distance between the pair of points. 

This kernel function is not something that SVM learns from the data. It is solely crafted by the human mind. The distance between the points in the original space to that of in the new representation space is mapped. And then the hyperplane is created through learning from the data. 

Pros of SVM

  • The process is very accurate for the limited amount of data and when data is scarce
  • It has a strong mathematical base and also in-depth mathematical analysis is possible in SVM
  • Interpretation is very easy
  • The popularity of this process was instant and unprecedented

It also sufferers from some weaknesses like:

  • Scalability is an issue. When the data set is vast it is not very suitable.
  • Modern-day databases with a huge amount of images with enormous information provided the recognition process is efficient. SVM is not the preferred candidate here.
  • It is a shallow method so feature engineering is not easy.

Decision tree

During 2000 another classification technique made its debau. And instantly became very popular. It even surpasses the popularity of SVM. Mainly because of its simplicity, ease of visualizing and interpretation it became so popular. It also uses an algorithm which consumes very limited resource. So, a low configuration of the computing system is not a constrain for the application of the decision tree. Its some other benefits are:

  • The decision tree has a great advantage of being capable of handling both numerical and categorical variables. Many other modelling techniques can handle only one kind of variable.
  • Requires no data processing which saves a lot of user’s time.
  • The assumptions are not too rigid and model can slightly deviate from them.
  • The decision tree model validation uses statistical tests and the reliability is easy to establish.
  • As it is a white box model, so the logic behind it is visible to us and we can easily interpret the result unlike the black-box model like an artificial neural network.

But it does suffer from some limitations. Like it has a problem of overfitting. Which means that the performance with training data does not reflect when an independent data set is used for prediction. It is quick to produce a result which is often lacking satisfactory accuracy.

However, since its inception in 2000, it continued its golden run till 2010.

Random forest

This technique came to improve the weaknesses of the decision tree. As decision tree was already popular for its simplicity. Random forest took no time to win the heart of all machine learning enthusiasts. 

As it overcomes the limitations of the Decision tree, it became the most practical and robust among the shallow ML algorithms. Random forest is actually ensembling of decision trees i.e. it is a collection of decision trees where each decision tree has trained with a different dataset. The more decision tree a random forest model includes, the more robust and accurate its result becomes. It is like as we consider a forest a robust one if it has many trees.

Random forest: ensemble of decision tree
Random forest: ensemble of decision tree

Random forest actually makes a final prediction from the prediction obtained from each of the decision tree models to overcome the weakness of a single decision tree model. In this sense, the random forest is a bagging type of ensemble technique. 

We can have an idea of Random forest’s popularity by the fact that in 2010 it became the most liked machine learning in the famous data science competition website Kaggle. 

The Gradient Boosting modelling was then the only other approach which came up as the closest competitor of random forest. This technique ensemble all other weak machine learning algorithms mainly decision tree. And it was quick to outperform random forest. 

In Kaggle very soon gradient boosting ensemble approach overtake random forest. And still, this technique is the most used machine learning method along with the deep learning technique in almost all Kaggle competitions.

Dark Knight rises: The neural network era starts

Although the neural network was not consistent in showing its potential since 1980. Its success when demonstrated by some researchers like from IBM etc. it surprised the whole world with intelligent machines like Deep Blue, Watson etc. 

The dedicated deep learning scientists putting their hard work in research never had any doubt about its potential and what it is capable of to do. The only constrain till then the research work was in very scattered form. 

A coordinated research effort was very much required to establish its potential beyond any doubt. The year 2010 marked the dawn of a new era when for the first time such effort was initiated by Yann LeCun of  New York University, Yoshua Bengio of the University of Montreal, Geoffrey Hinton and his group of University of Toronto and IDSIA in Switzerland.

From the group of researchers, Dan Ciresan of IDSIA first showed the world some successful applications of modern deep learning in 2011. Using his developed GPU trained deep learning network, he won some of the prestigious academic image classification competitions.  

The ImageNet

ImageNet image classification competition conceptualized by Geofrey Hinton and his group from the University of Toronto started a significant chapter in the history of Deep Learning Neural Net in the year 2012.  

Screenshot of ImageNet
Screenshot of ImageNet (

In the same year, a team headed by Alex Krizhevsky and guided by Geoffrey Hinton recorded an accuracy of 83.6% in this image classification challenge. Which was on quite a hire side compare to the accuracy of 74.3% achieved by computer vision using classical approaches in the year 2011. 

The ImageNet challenge was considered to be solved when someone with a deep convolutional network (convnets) improved the image classification accuracy up to 96.4%. Since then it was the deep convolutional neural net that had always dominated the machine learning domain.

The deep convolutional neural net got recognition by the whole world after its overwhelming success. Since then all major computer conferences and programmers meet almost all machine learning solutions are based on the deep convolutional neural net.

In some other fields like natural language processing, speech recognition also the deep convolutional neural net is a dominant technology replacing other previous tools like decision tree, SVM, random forests etc. 

A good example of major players switching to deep convolutional neural net from other technologies are like the European Organization for Nuclear Research, CERN the largest particle physics laboratory in the world has ultimately switched to deep convolutional neural net to identify new particles generated from Large Hadron Collider (LHC); earlier they were using decision tree-based machine learning methods for this task.


The article presents a detailed history of how deep learning has made a long way to reach today’s popularity and use in many fields across different scientific disciplines. It was a journey with many peaks and valleys which started way back in 1980. 

Different empirical statistical methods and machine learning algorithms preceding to deep learning made way for deep learning techniques mainly because of its high accuracy with a large amount of data. 

It registered many successes and then suddenly lost in despair for not being able to meet the high expectation. It always has a true potential being more a practical technique than empirical. 

Now the question is what is there in future of deep learning? What new surprises are in stock? The answer is really tough. The history we discussed here is evidence that many of them are already here to revolutionize our life.

So the next major breakthrough may also be just around the corner or it may take still years. But the field is always evolving and full of promises of blending machines with true intelligence. After all it learns from data so it will not repeat history of failures.


  • Chollet, F., 2018. Deep Learning mit Python und Keras: Das Praxis-Handbuch vom Entwickler der Keras-Bibliothek. MITP-Verlags GmbH & Co. KG.
  • Cortes, C. and Vapnik, V., 1995. Support-vector networks. Machine learning, 20(3), pp.273-297.
  • Vapnik, V., 1995. Support-vector networks. Machine learning, 20, pp.273-297.
  • Schölkopf, B., Burges, C. and Vapnik, V., 1996, July. Incorporating invariances in support vector learning machines. In International Conference on Artificial Neural Networks (pp. 47-52). Springer, Berlin, Heidelberg.
  • Rosenblatt, F., 1957. The perceptron, a perceiving and recognizing automaton Project Para. Cornell Aeronautical Laboratory.