Unsupervised learning is that field of Machine Learning which deals with unlabeled data. It means that the final goal of our algorithm is not finding the proper membership of a new observation based on its features. In fact, our algorithm will be able only to segregates in two or more classes the available entries, based on similarities among them.

Basically, depending on the existing features of the train set, the algorithm will capture those links and interconnections among one observations’ features and among features of different observations, with the aim of cluster them into similar groups.

In this article, I’m going to set an unsupervised learning task concerning customers segmentation. The idea is identifying two or more target sub-populations among customers so that the product analysis and business strategy could pursue a proper customer satisfaction for each cluster.

Exploratory Analysis

So let’s start importing the dataset we are going to use (you can find it available on Kaggle):

import pandas as pd
df=pd.read_csv("Mall_Customers.csv")
df.head()

Before starting with our clustering models, let’s spend some time on an exploratory analysis. The idea is looking for some kind of clues of, namely, correlations between features, or number of potential clusters and so forth.

Provided that our dataset does not exhibit missing values:

df.isnull().sum()

We can start by visualizing some straightforward summaries and statistics. Let’s start with looking at the frequency of males and females:

import seaborn as sns
sns.countplot(x='Gender', data=df)
plt.title('Distribution of Gender')

As you can see, females are more than males. Of course, one might think that it is simply due to the fact that females tend to purchase or spend more than males. However, it is worth investing a bit more time about the way data have been collected. Namely, imagine those data coming from a survey sent by e-mail to a sample of costumers (how many times did you receive similar e-mails?). Now, the fact that females are more than males might depend on the fact that they were more available to answer that survey: hence, the higher number of females would be due to a character feature of females, rather than their tendency to purchase/spend more.

With that being said, let’s proceed further. Now I’m interested in visualizing the distribution of ages and incomes for both the genders:

plt.hist('Age', data=df[df['Gender'] == 'Male'], alpha=0.5, color='g', label='Male')
plt.hist('Age', data=df[df['Gender'] == 'Female'], alpha=0.1, color='b',label='Female')
plt.title('Distribution of Age by Gender')
plt.xlabel('Age')
plt.legend()

It seems that females are, on average, younger than males, plus their range (difference between max and min age) is a bit less, and we can easily visualize that with the aid of a boxplot:

sns.set(style="whitegrid")
sns.boxplot(x="Gender", y="Age", data=df)

Now let’s proceed in the same way with Annual Income:

plt.hist('Annual Income (k$)', data=df[df['Gender'] == 'Male'], alpha=0.5, color='g', label='Male')
plt.hist('Annual Income (k$)', data=df[df['Gender'] == 'Female'], alpha=0.1, color='b',label='Female')
plt.title('Distribution of Age by Gender')
plt.xlabel('Annual Income (k$)')
plt.legend()
sns.set(style="whitegrid")
sns.boxplot(x="Gender", y="Annual Income (k$)", data=df)

As you can see, there is an outlier withing the males group (actually, if you look at the dataset there are two males with that extreme income).

Well, now let’s investigate about some possible correlations, with the aid of heatmap():

sns.heatmap(df.corr(), annot=True)

Without considering CostumerID (which is just the index of the df), we can see that there is an interesting negative correlation (-0.33) between age and spending score, while the others correlations are irrelevant and cannot bring to any significance consideration.

K-Means

So for our first clustering, let’s consider only those two variables: Age and Spending score. Note that, in real analysis, it is not a good practice to pick few variables and discard the others to reduce dimensionality. For that purpose, the Principal Component Analysis is for sure a better decision, since it is able to preserve the highest possible amount of information while reducing the number of variables (you can read more about PCA here).

For this first clustering, I’m going to use K-means, which I’ve been explaining in my previous article here.

Let’s first have a look at the scatter plot:

sns.scatterplot('Age', 'Spending Score (1-100)', data=df)
plt.title('Age to Spending Score')

To set the number of clusters we want to have as a result, we will use the Elbow method:

import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
wcss = []
K = range(1,15)
for k in K:
    km = KMeans(n_clusters=k)
    km = km.fit(df.iloc[:,[2,4]])
    wcss.append(km.inertia_)
plt.plot(K, wcss, 'bx-')
plt.xlabel('Number of centroids')
plt.ylabel('WCSS')
plt.title('Elbow Method For Optimal k')
plt.show()

Since the within-cluster variance (WCSS) start plateauing from the 4th cluster, we will set k=4.

kmeans = KMeans(n_clusters=4, random_state=0).fit(df.iloc[:,[2,4]])
y_kmeans = kmeans.predict(df.iloc[:,[2,4]])

plt.scatter(df['Age'], df['Spending Score (1-100)'], c=y_kmeans, s=50, alpha=0.5,cmap='viridis')
centers = kmeans.cluster_centers_
plt.xlabel('Age')
plt.ylabel('Spending Score (1-100)')

So a first set of targets might be the following:

  • Young-medium spending score
  • Young-high spending score
  • Old-medium spending score
  • Low spending score (regardless of the age)

Now let’s consider a further experiment. This time we will use two different features, which are Spending score and Annual Income:

sns.scatterplot('Annual Income (k$)', 'Spending Score (1-100)', data=df)
plt.title('Annual Income to Spending Score')
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
wcss = []
K = range(1,15)
for k in K:
    km = KMeans(n_clusters=k)
    km = km.fit(df.iloc[:,[3,4]])
    wcss.append(km.inertia_)
plt.plot(K, wcss, 'bx-')
plt.xlabel('Number of centroids')
plt.ylabel('WCSS')
plt.title('Elbow Method For Optimal k')
plt.show()

This time, the number of centroids is 5:

kmeans = KMeans(n_clusters=5, random_state=0).fit(df.iloc[:,[3,4]])
y_kmeans = kmeans.predict(df.iloc[:,[3,4]])

plt.scatter(df['Annual Income (k$)'], df['Spending Score (1-100)'], c=y_kmeans, s=50, alpha=0.5,cmap='viridis')
centers = kmeans.cluster_centers_
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')

It seems that our algorithm was able to well-segregate our data into 5 clusters, which might be:

  • Low income-Low spending score
  • Low income-high spending
  • Medium income-medium spending score
  • High income-low spending score
  • High income-high spending score

It is worth noticing that, for extreme values of incomes (low and high) corresponds extreme spending tendencies (low and high). On the contrary, people with medium income tend to spend accordingly, neither limiting themselves nor exceeding.

Of course, since we took only few variables (and we didn’t perform any PCA), we cannot conclude that those are the only two relevant results. Nevertheless, proceeding with further, similar analyses, accordingly with your KPIs, you will be able to extrapolate relevant information.

Advertisements

Published by valentinaalto

I'm a 22-years-old student based in Milan, passionate about everything related to Statistics, Data Science and Machine Learning. I'm eager to learn new concepts and techniques as well as share them with whoever is interested in the topic.

Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: