- Experience with the specific topic: Novice
- Professional experience: No industry experience
Knowledge of machine learning is not required, but the reader should be familiar with basic data analysis (e.g., descriptive analysis) and the programming language Python. To follow along, download the sample dataset here.
Introduction to K-means Clustering
K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data (i.e., data without defined categories or groups). The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K. The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity. The results of the K-means clustering algorithm are:
- The centroids of the K clusters, which can be used to label new data
- Labels for the training data (each data point is assigned to a single cluster)
Rather than defining groups before looking at the data, clustering allows you to find and analyze the groups that have formed organically. The "Choosing K" section below describes how the number of groups can be determined.
Each centroid of a cluster is a collection of feature values which define the resulting groups. Examining the centroid feature weights can be used to qualitatively interpret what kind of group each cluster represents.
This introduction to the K-means clustering algorithm covers:
- Common business cases where K-means is used
- The steps involved in running the algorithm
- A Python example using delivery fleet data
The K-means clustering algorithm is used to find groups which have not been explicitly labeled in the data. This can be used to confirm business assumptions about what types of groups exist or to identify unknown groups in complex data sets. Once the algorithm has been run and the groups are defined, any new data can be easily assigned to the correct group.
This is a versatile algorithm that can be used for any type of grouping. Some examples of use cases are:
- Behavioral segmentation:
- Segment by purchase history
- Segment by activities on application, website, or platform
- Define personas based on interests
- Create profiles based on activity monitoring
- Inventory categorization:
- Group inventory by sales activity
- Group inventory by manufacturing metrics
- Sorting sensor measurements:
- Detect activity types in motion sensors
- Group images
- Separate audio
- Identify groups in health monitoring
- Detecting bots or anomalies:
- Separate valid activity groups from bots
- Group valid activity to clean up outlier detection
In addition, monitoring if a tracked data point switches between groups over time can be used to detect meaningful changes in the data.
The Κ-means clustering algorithm uses iterative refinement to produce a final result. The algorithm inputs are the number of clusters Κ and the data set. The data set is a collection of features for each data point. The algorithms starts with initial estimates for the Κ centroids, which can either be randomly generated or randomly selected from the data set. The algorithm then iterates between two steps:
1. Data assigment step:
Each centroid defines one of the clusters. In this step, each data point is assigned to its nearest centroid, based on the squared Euclidean distance. More formally, if ci is the collection of centroids in set C, then each data point x is assigned to a cluster based on
where dist( · ) is the standard (L2) Euclidean distance. Let the set of data point assignments for each ith cluster centroid be Si.
2. Centroid update step:
In this step, the centroids are recomputed. This is done by taking the mean of all data points assigned to that centroid's cluster.
The algorithm iterates between steps one and two until a stopping criteria is met (i.e., no data points change clusters, the sum of the distances is minimized, or some maximum number of iterations is reached).
This algorithm is guaranteed to converge to a result. The result may be a local optimum (i.e. not necessarily the best possible outcome), meaning that assessing more than one run of the algorithm with randomized starting centroids may give a better outcome.
The algorithm described above finds the clusters and data set labels for a particular pre-chosen K. To find the number of clusters in the data, the user needs to run the K-means clustering algorithm for a range of K values and compare the results. In general, there is no method for determining exact value of K, but an accurate estimate can be obtained using the following techniques.
One of the metrics that is commonly used to compare results across different values of K is the mean distance between data points and their cluster centroid. Since increasing the number of clusters will always reduce the distance to data points, increasing K will always decrease this metric, to the extreme of reaching zero when K is the same as the number of data points. Thus, this metric cannot be used as the sole target. Instead, mean distance to the centroid as a function of K is plotted and the "elbow point," where the rate of decrease sharply shifts, can be used to roughly determine K.
A number of other techniques exist for validating K, including cross-validation, information criteria, the information theoretic jump method, the silhouette method, and the G-means algorithm. In addition, monitoring the distribution of data points across groups provides insight into how the algorithm is splitting the data for each K.
Example: Applying K-Means Clustering to Delivery Fleet Data
As an example, we'll show how the K-means algorithm works with a sample dataset of delivery fleet driver data. For the sake of simplicity, we'll only be looking at two driver features: mean distance driven per day and the mean percentage of time a driver was >5 mph over the speed limit. In general, this algorithm can be used for any number of features, so long as the number of data samples is much greater than the number of features.
Step 1: Clean and Transform Your Data
For this example, we've already cleaned and completed some simple data transformations. A sample of the data as a is shown below.
The chart below shows the dataset for 4,000 drivers, with the distance feature on the x-axis and speeding feature on the y-axis.
Step 2: Choose K and Run the Algorithm
Start by choosing K=2. For this example, use the Python packages scikit-learn and NumPy for computations as shown below:
The cluster labels are returned in .
Step 4: Iterate Over Several Values of K
Test how the results look for K=4. To do this, all you need to change is the target number of clusters in the function.
The chart below shows the resulting clusters. We see that four distinct groups have been identified by the algorithm; now speeding drivers have been separated from those who follow speed limits, in addition to the rural vs. urban divide. The threshold for speeding is lower with the urban driver group than for the rural drivers, likely due to urban drivers spending more time in intersections and stop-and-go traffic.
Additional Notes and Alternatives
Feature engineering is the process of using domain knowledge to choose which data metrics to input as features into a machine learning algorithm. Feature engineering plays a key role in K-means clustering; using meaningful features that capture the variability of the data is essential for the algorithm to find all of the naturally-occurring groups.
Categorical data (i.e., category labels such as gender, country, browser type) needs to be encoded or separated in a way that can still work with the algorithm.
Feature transformations, particularly to represent rates rather than measurements, can help to normalize the data. For example, in the delivery fleet example above, if total distance driven had been used rather than mean distance per day, then drivers would have been grouped by how long they had been driving for the company rather than rural vs. urban.
A number of alternative clustering algorithms exist including DBScan, spectral clustering, and modeling with Gaussian mixtures. A dimensionality reduction technique, such as principal component analysis, can be used to separate groups of patterns in data. You can read more about alternatives to K-means in this post.
One possible outcome is that there are no organic clusters in the data; instead, all of the data fall along the continuous feature ranges within one single group. In this case, you may need to revisit the data features to see if different measurements need to be included or a feature transformation would better represent the variability in the data. In addition, you may want to impose categories or labels based on domain knowledge and modify your analysis approach.
For more information on K-means clustering, visit the scikit learn site.
Want to keep learning? Download our new study from Forrester about the tools and practices keeping companies on the forefront of data science.
Step 3: Review the Results
The chart below shows the results. Visually, you can see that the K-means algorithm splits the two groups based on the distance feature. Each cluster centroid is marked with a star.
- Group 1 Centroid = (50, 5.2)
- Group 2 Centroid = (180.3, 10.5)
Using domain knowledge of the dataset, we can infer that Group 1 is urban drivers and Group 2 is rural drivers.
Now let's look at the popular algorithm for hard clustering called
K-means and see if we can connect it to the general form of the expectation maximization.
So, it would be really cool to drive,
it is a special case of EM and to feed it in the general framework.
So hard clustering.
We have a data set of points and if we want to assign each data point to
some of the clusters and now we can't say that suddenly the point,
it belongs to several clusters simultaneously.
Now which data point should be assigned to one and just one cluster, okay?
The K-means algorithm suggest
you to solve this hard clustering problem in the following way.
First of all, let's initialize the parameters randomly.
And here are the parameters are just means so the locations of the cluster.
We don't have weights by and we don't have
some shapes for the clusters Sigma, just locations.
And then in iterations,
we're repeating two sub steps until convergence.
So the first sub step is to find for each data point,
find the closest cluster and assign these data points to this cluster.
So each data point is assigned to belong to the closest cluster to it.
And by closest, I mean according to Euclidean distance.
And on the next sub step,
K-means suggest you to update the parameters by finding,
so for example, to update the first Gaussian,
the first cluster centroid Mu_1
we'll have to find all the data points which we are assigned
to the first cluster on the first sub step and
then find their average which will be the center of this cluster,
updated center and then we repeat these two things until convergence.
So if you look carefully at this algorithm,
it looks really similar to the EM, right?
We also have random initializations and then we in iterations repeat
two steps which also look really close to the stuff we had in the EM.
First of all, we compute some property for
each data point and then we update the parameters by using these properties.
Let's see if we can somehow convert our Gaussian Mixture Model and EM applied to it,
so to obtain exactly this K-means algorithm.
So to prove that the K-means is just a special case of
this EM algorithm applied to Gaussian Mixture Model or maybe to some other model.
So first of all, we have to say that we don't have these additional parameters, right?
So let's say that the covariance matrices of the shapes are all
identical matrices which means that all the shapes of each Gaussian is
just uniform circle with fixed radius and the prior weights pi are all uniform also.
So they equal to one divided by the number of clusters.
This way we will have only Mu as the parameter of our Gaussian Mixture Model or kind of
restricted Gaussian Mixture Model and then
the density of each data point given that we know the cluster,
now looks like this.
So, it's some normalization times exponent of
point five times the Euclidean distance between x1 and Mu_c where Mu_c
is the center of the Gaussian number c. You can prove that
this is exactly the formula for multidimensional Gaussian, multivarian Gaussian.
If you have identical covariance matrix sigma,
if it equals to identical matrix.
Now, let's look at the E and M steps of this restricted Gaussian Mixture Model.
On the E -step, the expectation-maximization algorithm suggests you
to find the minimum of the KL divergence
between the variational distribution q
and the posterior distribution p(t) given data and parameters.
And if we are going to do it,
we will find not hard assignments but soft assignments, right?
As we obtained in the Gaussian Mixture Model.
So for each data point this q will say with
which probability it belongs to one cluster or another or one Gaussian or another.
And this is not what we want.
We want to derive K-means from this model.
So we want this step to be hard clustering.
How can we do it? Well, let's find
not the best optimal q but q among some family of simple qs.
So let's look at the q among all delta functions.
Where delta function means that this probability distribution takes
one value of probability one and all the other values with probability zero.
It's really certain about what the outcome will be.
So, let's imagine that the posterior distribution looks like this.
So our data point belongs to the cluster number two with probability
like 0.6 and cluster number one with probability 0.4. What do you think?
What the delta function approximation will be?
So what will be the closest q to it according to
KL distance among the family of delta functions?
Well, it will be a distribution like this.
It will put all the probability mass into the class that
had the highest probability according to our posterior.
So by restricting the set of possible qs on the E-step,
we're actually approximating the actual posterior distribution with delta functions.
And this way the optimal q on the E-step will look like this, right?
It's some delta function.
So it's probability one then t_i equals to some predefined value and zero otherwise.
And what is c_i? What does this predefined value?
Well, it's the most probable configuration according to our posterior distribution.
So, we have to find c_i which maximizes
the posterior probability of latent variable t_i given the data and the parameters theta.
Recall that the posterior distribution itself
is proportional to the full joint distribution which is x,
u and t as p of t and
parameters and we also have some normalization constant but we don't care about it
because we want to maximize this thing with respect to
t_i or with respect to c which like respect to the value of t_i.
So normalization doesn't change anything.
And if you recall,
this thing equals to normalization times the x given t which is exponent of
minus 0.5 times Euclidean distance and times
the prior and we agree that the prior will be just uniform.
So it doesn't depend on c and we also don't care about when we maximize this thing.
So finally, we want to maximize this expression with respect to
the value of t_i and we have normalization one divided by Z and we have.
pi of c which both actually doesn't depend on c. So,
we can throw it away and will not change anything.
And maximizing the exponent of minus something is the same as minimizing something.
So we can just minimize 0.5 times Euclidean distance.
And finally, we have an expression like this.
So our optimal q from the E-step is the same as,
is just a delta function which takes the value of arg minimum of the Euclidean distance.
So the closest cluster center with probability
one and it has probability zero to be anything else.
So, this is exactly like we have in K-means,
which means that this restricted Gaussian Mixture Model if you apply EM to it
and if you restrict your possible values
of the variational distribution q on the E- step to be delta functions,
you will get exactly the same first sub step of the iteration.
So now, let's derive the formula for
the M-step on the blackboard and see how it's connected,
how does the formula for the second sub step of K-means and x to
the M-step of the expectation maximization
apply to this restricted Gaussian Mixture Model with
this particular choice of q which is delta function.