Unsupervised Machine Learning is a kind of Machine Learning where the algorithm identifies some hidden pattern in the data on its own. This type of Machine Learning is used when there is no labeled data available to train the algorithm.
Unlike Supervised Machine Learning here the input dataset is not tagged with some known answers. This is because in many cases we need to predict such situations which are completely new. The experimenter has no experience about the data in hand, its distribution and parameters are also unknown.
So, in this case, the application of Supervised Learning is not feasible. So we have to go for Unsupervised Machine Learning. The main problem with this approach is that we have no test dataset labeled with the correct answer to check the accuracy of such an unsupervised learning process. That’s why it has lesser accuracy than supervised learning.
Learning process of a baby
Application of unsupervised learning resembles the learning process of babies. They start learning process themselves at the first. No one teaches them. They start identifying objects from their experience.
For example, since birth they see human and no one teach them about characteristics of it. But whenever the baby sees a new human around he matches the characteristics and recognizes the new object as a human being. This is a very basic example of unsupervised learning.
Application of Unsupervised Machine Learning
Although this approach has a problem of lesser accuracy, it is useful to find out hidden pattern in the data.
You might have used google’s speech recognition tool. It is such a handy tool to convert your speech into text. When you have to write a lot of text, you can certainly use it to your advantage. I also use it frequently during writing my articles in Google doc.
So the point is the technology used for this handy tool for speech recognition is nothing but unsupervised machine learning. The annotation process from voice to text is very costly so, labeled data is not available to train the algorithm.
Detection of anomaly
Unsupervised classification can also come handy to detect extreme values in the dataset. Such data generally comprises outliers which are erroneous observation due to mechanical error or error during data collection, fraudulent transaction data in bank transaction statement likewise.
Clustering of data
Clustering is a grouping of data on the basis of some uniformity. It reveals the data structure and helps to design the classifier.
Finds hidden patterns and feature of the data
Unsupervised learning finds out all kinds of hidden pattern and features of the which consequently helps in categorization.
Issues with unsupervised machine learning
- The process has some inherent issues which you must consider before its application.
- Unsupervised learning results are less accurate compare to that of supervised learning and it is very obvious too.
- Performing unsupervised learning is much more complicated than a supervised one.
- Validation of the model is not possible due to lack of labeled data.
Types of unsupervised machine learning
Unsupervised machine learning can be further grouped into two broad categories which are clustering and association problems.
It is of great importance when we discuss unsupervised learning. This technique finds out some similarity in the uncategorized data and groups them to create different clusters. This clustering process is hugely beneficial to gather some basic information about the data in hand. For finding patterns and features of the dataset which is otherwise completely unknown to the researchers
We can decide how many clusters we should create. The clusters are so formed so that the within-cluster variance is lower compare to between cluster variance. In similarity measure it can be phrased as the members of a cluster are similar whereas members of different clusters are dissimilar.
We perform this clustering through several approaches.
Here every data point is considered an individual cluster to start with. Then in similarity basis, the most similar data points are clubbed to form a single cluster. This process continues until the decided number of clusters is achieved.
Here as the name suggests, we do the clustering on the basis of a probability distribution. For example, if there are keywords like
Then the clusters can form two categories either “boy” and “girl” or “school” and “college”
If data points are such that they are very exclusive to a particular category. Then in a straight manner, we form the clusters according to data points exclusivity. Here no single data point can belong to more than one clusters.
In contrast to exclusive clusters in overlapping clustering, one particular data point can belong to more than one clusters. To achieve such clustering, we use fuzzy sets.
There are some popular algorithms to perform clustering. In this article, I will briefly discuss them. Each of them will have an elaborate discussion in separate articles.
K-means clustering is a type of clustering where data points are grouped into k clusters. If the value of k is large them the cluster size is small and if k has small value then cluster size is bigger.
Every cluster has a value called the centroid. This is kind of the heart of the cluster. The distance of other data points from this centroid determines if they qualify for the cluster or not.
K- Nearest Neighbors
It is a simple algorithm and performs well when there is a significant distance between the sample data points. It is the most simple classification method under unsupervised machine learning but takes considerable time when the dataset is large.
Principal Component analysis
It is a variable reduction technique. The basic objective of PCA is to calculate fewer number of new variables maintaining the variance of the data as explained by the original variables.
This is a hierarchical clustering technique. Hierarchical in the sense that it starts with considering each data points as a cluster and then goes on forming clusters by including close clusters. This process continues until only one cluster remains.
This is a more generalized form of K-means clustering. Here also clusters are formed using a centroid value. But the difference is that in simple K-means clustering, the data points are either same as the centroid or it is different, there is no in-between position; whereas in fuzzy k-means clustering algorithm assigns a probability to each data points depending on its distance from the centroid. K-means clustering simply a special case of fuzzy K-means clustering where the probability is either 1 or 0.
This also about pattern or feature identification from large database. Unsupervised machine learning uses this association rules to find out the interesting relationship between variables. For example, students in a class can be a subject of this association rule based on their choice of subject.
So, we can summarize some important points about unsupervised machine learning which are as follows:
Unsupervised machine learning is the type of machine learning where we don’t use any lebeled data.
No labeled data, so no supervision of the result and no validation
It has less accuracy compare to that of supervised machine learning
Unsupervised learning is more complicated than supervised learning
Unsupervised learning proves helpful when we have no idea about the data, its distribution and parameters are also unknown.
Two main methods of conducting unsupervised machine learning are clustering and association.