Machine Learning algorithms can be categorized mainly into two bunches:
- supervised learning: we are provided with data which are already labeled, hence our aim will be finding, once provided with a new observation, its category (in case of a classification task) or its numerical value (in case of a regression task);
- unsupervised learning: in this case, data are not endowed with a primary classification, hence our aim will be, basically, clustering those data into multiple classes, based on the inner structure and common patterns our algorithm can identify.
I’ve already been talking about supervised learning in one of my previous article (you can learn more about it here). Now, I’m going to talk about the unsupervised way and, more specifically, how to set your analysis environment when you want to cluster your unlabeled data.
For this purpose, I will employ one of the pre-built dataset available in the Python library Sklearn, extracting both features and labels, but ignoring those latter, so that we are simulating an unsupervised environment. Then, I will build my algorithm, called K-means.
import sklearn import pandas as pd from sklearn.datasets import load_wine raw_data = load_wine() features, target = load_wine(return_X_y=True) #since my raw data are a dictionary, I want to extract my features' names names=raw_data['feature_names'] df=pd.DataFrame(features) df.columns=names
To have a more intuitive idea of how the K-means algorithm actually works, however, I’d rather prefer working with only two features, or variables, so that we can clearly visualize it in a 2-D plot.
To do that, we need to introduce the Principal Component Analysis (PCA). PCA is a method used to reduce dimensionality in high-dimensional datasets, like the one we are dealing with.
Principal components are new variables that are constructed as linear combinations of the initial variables. These combinations are done in such a way that these new variables are uncorrelated and most of the information within the initial variables is stored into the first components. So, the idea is that k-dimensional data give you k principal components, but PCA tries to put maximum possible information in the first ones, so that, if you want to reduce your dataset’s dimensionality, you can focus your analysis on the first few components without suffering a great penalty in terms of information loss.
In this analysis, what measures the amount of information is variance, and principal components can be geometrically seen as the directions of high-dimensional data which capture the maximum amount of variance and project it onto a smaller dimensional subspace while keeping most of the information. Hence, the first principal component accounts for the largest possible variance; the second component will, intuitively, account for the second largest variance (under one condition: it has to be uncorrelated with the first principal component) and so forth.
I won’t dive too much into the maths behind, but I want to make clear the steps that our algorithm will be following:
- First, it computes the eigenvectors and eigenvalues from the covariance/correlation matrix
- It sorts eigenvalues in descending order and chooses the k largest values, where k is the number of components we want to work with
- It builds the projection matrix W of the k eigenvectors
- It transforms the original dataset via W, to obtain a k-dimensional space
Now let’s see what happens when we apply it to our dataset:
from sklearn.decomposition import PCA pca = PCA(2) projected = pca.fit_transform(features) print(features.shape)
As you can see, from 13 variables we are now dealing with only 2 variables, that, again, are not the same as our original variables: they are new components, resulting from the linear combination of pre-existing variables.
Let’s visualize them:
Nice, now let’s set up our algorithm. As anticipated, I will employ the unsupervised algorithm K-means, and before using it I want to provide a quick explanation about the idea behind it.
The first step of this algorithm is creating, among our unlabeled observations, c new observations, randomly located, called ‘centroids’. The number of centroids will be representative of the number of output classes (which, remember, we do not know). Now, an iterative process will start, made of two steps:
- first, for each centroid, the algorithm finds the nearest points (in terms of distance that is usually computed as Euclidean distance) to that centroid, and assigns them to its category;
- second, for each category (represented by one centroid), the algorithm computes the average of all the points which has been attributed to that class. The output of this computation will be the new centroid for that class.
Every time the process is reiterated, some observations, initially classified together with one centroid, might be redirected to another one. Furthermore, after several reiterations, the change in centroids’ location should be less and less important since the initial random centroids are converging to the real ones. This process ends when there is no more change in centroids’ position.
Now, how can we decide the number of centroids? Well, in our dataset, if we inspect it accurately, we can see that it already stores the target column, hence we know that the number of classes is equal to three. However, let’s forget it for a moment and think about a task where we have no idea of the number of classes which might be resulting.
There are many methods that could be employed for this task. However, in this article, I’m going to explain and use the so-called ‘Elbow method’. The idea is that what we would like to observe within our clusters is a low level of variation, which is measured with the within-cluster sum of squares (WCSS). And it is intuitive to understand that, the higher the number of centroids, the lower the WCSS. In particular, if we have as many centroids as the number of our observations, each WCSS will be equal to zero. However, if we remember the law of parsimony, we know that setting the highest number possible of centroids would be inconsistent.
The idea is picking that number of centroids after which, by increasing the number of centroids, we have an irrelevant decrease in WCSS. The relation I just described can be represented with the following graph:
The idea is that, if the plot is an arm, the elbow of the arm is the optimal number of centroids.
Now let’s apply this rule to our dataset:
from sklearn.cluster import KMeans wcss =  K = range(1,15) for k in K: km = KMeans(n_clusters=k) km = km.fit(pca_df)
plt.plot(K, wcss, 'bx-') plt.xlabel('Number of centroids') plt.ylabel('WCSS') plt.title('Elbow Method For Optimal k')
We can see from this graph that our elbow corresponds to a number of centroids equal to three (and we are secretly happy because we know that this is exactly the number of our actual classes).
Now that we know that the best K-means has 3 centroids, let’s deploy it on our dataset:
kmeans = KMeans(n_clusters=3, random_state=0).fit(pca_df) y_kmeans = kmeans.predict(pca_df) plt.scatter(pca_df['First component'], pca_df['Second component'], c=y_kmeans, s=50, alpha=0.5,cmap='viridis') centers = kmeans.cluster_centers_ plt.scatter(centers[:, 0], centers[:, 1], c='red', s=50)
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=50)
So in this article we had a glimpse of how an unsupervised task could be set. Unsupervised learning is extremely performant in finding patterns in data which aren’t so evident and which might contain a relevant value. It means that you can use unsupervised models even in a supervised task: even though you have your data already labeled, you might be interested in inferring their inner structure and find new classes or patterns which wouldn’t be emerging in a supervised task, where you are ‘biased’ and limit your analysis to the existing labels.
Moreover, if your data are not labeled, like in the example above, you can first create a set of classes with unsupervised models, and then train supervised algorithms so that they will be able to directly identify the membership of new observations.
What is important to keep in mind, regardless of the kind of task (supervised or unsupervised) is setting your environment properly, inspecting your data and make them as simple (in terms of the number of variables) as possible, with barely any loss in terms of information.