Unsupervised Machine Learning: a detailed discussion

Unsupervised Machine Learning

Unsupervised Machine Learning is a kind of Machine Learning where the algorithm identifies some hidden pattern in the data on its own. This type of Machine Learning is used when there is no labeled data available to train the algorithm. 

Unlike Supervised Machine Learning here the input dataset is not tagged with some known answers. This is because in many cases we need to predict such situations which are completely new. The experimenter has no experience about the data in hand, its distribution and parameters are also unknown. 

So, in this case, the application of Supervised Learning is not feasible. So we have to go for Unsupervised Machine Learning. The main problem with this approach is that we have no test dataset labeled with the correct answer to check the accuracy of such an unsupervised learning process. That’s why it has lesser accuracy than supervised learning.

Learning process of a baby

Application of unsupervised learning resembles the learning process of babies. They start learning process themselves at the first. No one teaches them. They start identifying objects from their experience.

Learning process of a baby is similar to the unsupervised learning
Photo by Guillaume de Germain on Unsplash

For example, since birth they see human and no one teach them about characteristics of it. But whenever the baby sees a new human around he matches the characteristics and recognizes the new object as a human being. This is a very basic example of unsupervised learning.

Application of Unsupervised Machine Learning

Although this approach has a problem of lesser accuracy, it is useful to find out hidden pattern in the data. 

Speech recognition

You might have used google’s speech recognition tool. It is such a handy tool to convert your speech into text. When you have to write a lot of text, you can certainly use it to your advantage. I also use it frequently during writing my articles in Google doc.

So the point is the technology used for this handy tool for speech recognition is nothing but unsupervised machine learning. The annotation process from voice to text is very costly so, labeled data is not available to train the algorithm.

Detection of anomaly

Unsupervised classification can also come handy to detect extreme values in the dataset. Such data generally comprises outliers which are erroneous observation due to mechanical error or error during data collection, fraudulent transaction data in bank transaction statement likewise.

Clustering of data

Clustering is a grouping of data on the basis of some uniformity. It reveals the data structure and helps to design the classifier. 

Finds hidden patterns and feature of the data

Unsupervised learning finds out all kinds of hidden pattern and features of the which consequently helps in categorization.

Issues with unsupervised machine learning

  • The process has some inherent issues which you must consider before its application. 
  • Unsupervised learning results are less accurate compare to that of supervised learning and it is very obvious too.
  • Performing unsupervised learning is much more complicated than a supervised one.
  • Validation of the model is not possible due to lack of labeled data.

Types of unsupervised machine learning

Unsupervised machine learning can be further grouped into two broad categories which are clustering and association problems.

Clustering

It is of great importance when we discuss unsupervised learning. This technique finds out some similarity in the uncategorized data and groups them to create different clusters. This clustering process is hugely beneficial to gather some basic information about the data in hand. For finding patterns and features of the dataset which is otherwise completely unknown to the researchers

Clustering in unsupervised machine learning

We can decide how many clusters we should create. The clusters are so formed so that the within-cluster variance is lower compare to between cluster variance. In similarity measure it can be phrased as the members of a cluster are similar whereas members of different clusters are dissimilar.

We perform this clustering through several approaches. 

Hierarchical clustering

Here every data point is considered an individual cluster to start with. Then in similarity basis, the most similar data points are clubbed to form a single cluster. This process continues until the decided number of clusters is achieved.

Probabilistic clustering

Here as the name suggests, we do the clustering on the basis of a probability distribution. For example, if there are keywords like 

“Boys’ school”

“Girls’ school”

“Girls’ college”

“Boys’ college”

Then the clusters can form two categories either “boy” and “girl” or “school” and “college”

Exclusive clustering

If data points are such that they are very exclusive to a particular category. Then in a straight manner, we form the clusters according to data points exclusivity. Here no single data point can belong to more than one clusters.

Overlapping clustering

In contrast to exclusive clusters in overlapping clustering, one particular data point can belong to more than one clusters. To achieve such clustering, we use fuzzy sets.

Clustering algorithms

There are some popular algorithms to perform clustering. In this article, I will briefly discuss them. Each of them will have an elaborate discussion in separate articles.

K-means 

K-means clustering is a type of clustering where data points are grouped into k clusters. If the value of k is large them the cluster size is small and if k has small value then cluster size is bigger.

Every cluster has a value called the centroid. This is kind of the heart of the cluster. The distance of other data points from this centroid determines if they qualify for the cluster or not.

K- Nearest Neighbors

It is a simple algorithm and performs well when there is a significant distance between the sample data points. It is the most simple classification method under unsupervised machine learning but takes considerable time when the dataset is large.

Principal Component analysis

It is a variable reduction technique. The basic objective of PCA is to calculate fewer number of new variables maintaining the variance of the data as explained by the original variables.

Hierarchical clustering

This is a hierarchical clustering technique. Hierarchical in the sense that it starts with considering each data points as a cluster and then goes on forming clusters by including close clusters. This process continues until only one cluster remains.

Fuzzy K-means

This is a more generalized form of K-means clustering. Here also clusters are formed using a centroid value. But the difference is that in simple K-means clustering, the data points are either same as the centroid or it is different, there is no in-between position; whereas in fuzzy k-means clustering algorithm assigns a probability to each data points depending on its distance from the centroid. K-means clustering simply a special case of fuzzy K-means clustering where the probability is either 1 or 0.

Association

This also about pattern or feature identification from large database. Unsupervised machine learning uses this association rules to find out the interesting relationship between variables. For example, students in a class can be a subject of this association rule based on their choice of subject.

Summary

So, we  can summarize some important points about unsupervised machine learning which are as follows:

Unsupervised machine learning is the type of machine learning where we don’t use any lebeled data.

No labeled data, so no supervision of the result and no validation

It has less accuracy compare to that of supervised machine learning

Unsupervised learning is more complicated than supervised learning

Unsupervised learning proves helpful when we have no idea about the data, its distribution and parameters are also unknown.

Two main methods of conducting unsupervised machine learning are clustering and association.

References:

  • https://towardsdatascience.com
  • https://www.guru99.com
  • https://www.geeksforgeeks.org