Back to Lessons
unsupervised

Clustering

K-Means, DBSCAN, GMM. Discovering hidden structure in unlabeled data.

Written byOmansh
12 min read
From Scratch
Source Code

Everything we've built so far has been dependent on pre-existing labels—training models to predict future ones. Unfortunately, in the real world, data labels are both expensive and inaccessible. What if you don't have labels? What if you just have data and want to find natural groupings or patterns within it?

This is when we use clustering. Unlike the supervised models we've seen until now, this is an unsupervised learning approach that finds structure in unlabeled data. Today I'm building three fundamentally different clustering algorithms from scratch: K-Means (the fast and simple standard), DBSCAN (for weird-shaped clusters and noise), and Gaussian Mixture Models (a probabilistic alternative).

The Clustering Problem

At its core, clustering is about grouping similar things together. Given a dataset with no labels, we want to partition it into groups (clusters) such that points within a group are more similar to each other than to points in other groups.

The General Problem

Given n data points in d-dimensional space, assign each point to one of k groups (or mark it as noise) such that some objective is optimized.

But what does “similar” mean? And what makes a good grouping? That's where different clustering algorithms diverge—they each have different notions of similarity and cluster quality.

Customer Segmentation

Group customers by purchasing behavior to target marketing campaigns. E-commerce companies cluster users into “bargain hunters,” “brand loyal,” “impulse buyers,” etc.

Image Segmentation

Partition an image into regions. Medical imaging uses clustering to separate tissues, tumors, or organs in scans.

Document Organization

Group similar documents together. News aggregators cluster articles about the same story from different sources.

Anomaly Detection

Identify outliers that don't fit any cluster. Fraud detection systems use this to find unusual transactions.

K-Means Clustering

The Classic Centroid-Based Approach

K-Means is probably the most widely used clustering algorithm. The idea is beautifully simple: a cluster is defined by its center point (centroid), and each data point belongs to whichever centroid it's closest to.

Interactive K-Means Clustering

Watch centroids move and points reassign until convergence

k =
Data pointCentroid
Iteration
0
Phase
Initialize
Inertia (SSE)
Cluster Sizes
0
0
0
1. Initialize
Pick k random points as initial centroids
2. Assign
Assign each point to nearest centroid
3. Update
Move centroids to cluster means

The Algorithm

1
InitializeRandomly select k points as initial centroids
2
AssignmentAssign each point to its nearest centroid
3
UpdateMove each centroid to the mean of its assigned points
4
RepeatIterate until centroids stop moving (convergence)
kmeans_core.py
1def _initialize_centroids(self, X):
2 n_samples, n_features = X.shape
3 indices = np.random.choice(n_samples, self.n_clusters, replace=False)
4 return X[indices]
5
6def _assign_clusters(self, X, centroids):
7 distances = cdist(X, centroids, metric='euclidean')
8 return np.argmin(distances, axis=1)
9
10def _update_centroids(self, X, labels):
11 centroids = np.zeros((self.n_clusters, X.shape[1]))
12 for k in range(self.n_clusters):
13 cluster_points = X[labels == k]
14 if len(cluster_points) > 0:
15 centroids[k] = cluster_points.mean(axis=0)
16 return centroids

The cdist function computes the pairwise Euclidean distances between all points and all centroids. For each point, argmin gives us the index of the closest centroid.

kmeans_fit.py
1def fit(self, X):
2 self.centroids_ = self._initialize_centroids(X)
3
4 for iteration in range(self.max_iters):
5 labels = self._assign_clusters(X, self.centroids_)
6 new_centroids = self._update_centroids(X, labels)
7
8 # Check for convergence
9 centroid_shift = np.sum((new_centroids - self.centroids_) ** 2)
10 if centroid_shift < self.tol:
11 self.converged_ = True
12 break
13
14 self.centroids_ = new_centroids
15
16 self.labels_ = self._assign_clusters(X, self.centroids_)
17 return self

Why K-Means Works

K-Means is actually optimizing an objective function called inertia (or within-cluster sum of squares):

Minimize:Σᵢ Σⱼ∈Cᵢ||xⱼμᵢ||²

Sum of squared distances from each point to its cluster centroid

Where μᵢ is the centroid of cluster Cᵢ. The assignment and update steps are performing coordinate descent on this objective, which is guaranteed to decrease (or stay the same) at each iteration.

K-Means Limitations

K-Means assumes spherical clusters of similar size. It struggles with non-spherical shapes (elongated, curved, or irregular), clusters of very different sizes or densities, and outliers (they pull centroids away from the true cluster centers).

Choosing K: How Many Clusters?

The Elbow Method

Choosing k is critical. If it's too low, you'll lump distinct groups together. If it's too high, you'll split natural clusters into artificial subgroups. The elbow method helps you find the sweet spot.

The Elbow Method

Finding the optimal number of clusters

Inertia (SSE)Number of Clusters (k)12345678910k = 4SSE: 19,997
Selected k
4
Inertia
19997
Suggested k
4
Based on elbow point
The elbow is where adding more clusters doesn't significantly reduce inertia. It's the point of diminishing returns.

The idea: start with k = 1 and iterate upward, calculating inertia at each step. Plot inertia vs. k—you'll see it decreasing. Look for the “elbow” where adding more clusters stops significantly reducing inertia. That's your optimal k.

DBSCAN: Finding Clusters by Density

Density-Based Clustering

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) takes a completely different approach. Instead of defining clusters by centroids, it defines them by density—a cluster is a region where points are densely packed together, separated by regions of low density.

Core Points

Points with at least minPts neighbors within radius ε. They form the “bulk” of clusters.

Border Points

Within ε of a core point but don't have enough neighbors themselves. They're on the edges.

Noise Points

Not core points and not within ε of any core point. These are outliers.

Interactive DBSCAN Clustering

Density-based clustering with core, border, and noise points

ε =35
minPts =
CoreBorderNoise
Clusters Found
0
Point Types
Core0
Border0
Noise0
Parameters
ε (epsilon)35
minPts4
DBSCAN finds clusters by connecting points that are close together. A core point has at least minPts neighbors within radius ε. Border points are within ε of a core point but don't have enough neighbors. Noise points don't belong to any cluster.
dbscan_fit.py
1def fit(self, X):
2 n_samples = X.shape[0]
3 self.labels_ = np.full(n_samples, -1, dtype=int) # -1 = noise
4 cluster_id = 0
5
6 for point_idx in range(n_samples):
7 if self.labels_[point_idx] != -1: # Already processed
8 continue
9
10 neighbors = self._get_neighbors(X, point_idx)
11
12 if len(neighbors) < self.min_samples: # Not a core point
13 continue # Stays as noise (may become border later)
14
15 # Start a new cluster
16 self.labels_[point_idx] = cluster_id
17 seed_set = list(neighbors)
18 i = 0
19
20 while i < len(seed_set):
21 neighbor_idx = seed_set[i]
22
23 if self.labels_[neighbor_idx] == -1:
24 self.labels_[neighbor_idx] = cluster_id
25
26 neighbor_neighbors = self._get_neighbors(X, neighbor_idx)
27 if len(neighbor_neighbors) >= self.min_samples:
28 for new_neighbor in neighbor_neighbors:
29 if new_neighbor not in seed_set:
30 seed_set.append(new_neighbor)
31
32 i += 1
33
34 cluster_id += 1

DBSCAN's genius is that it defines clusters by connectivity through density. Two points are in the same cluster if you can “walk” from one to the other through a chain of core points, where each step is at most ε distance.

ParameterToo SmallToo Large
ε (epsilon)Everything is noiseEverything is one cluster
minPtsMore noise, denser core requirementSparse clusters split, more border points

DBSCAN Strengths

Unlike K-Means, DBSCAN can find clusters of arbitrary shapes, handles clusters of different sizes and densities, identifies noise and outliers automatically, and doesn't require specifying the number of clusters in advance.

Gaussian Mixture Models

Probabilistic Clustering

Both K-Means and DBSCAN make hard assignments—each point belongs to exactly one cluster (or is noise). Gaussian Mixture Models (GMMs) take a probabilistic approach: each point has a probability of belonging to each cluster.

The Generative Model

GMM assumes your data was generated by sampling from a mixture of k Gaussian (normal) distributions, each representing a different cluster:

πₖ— Probability of choosing cluster k (mixture weight)
μₖ— Mean vector of cluster k
Σₖ— Covariance matrix (captures shape and orientation)

Interactive Gaussian Mixture Model

Probabilistic clustering with soft assignments

components =
GaussianData (soft)
EM Iteration
0
Phase
Initialize
Log-Likelihood
Mixture Weights
E-Step (Expectation)
Compute probability each point belongs to each Gaussian
M-Step (Maximization)
Update means, covariances, and weights to maximize likelihood

GMMs are trained using the Expectation-Maximization (EM) algorithm, which alternates between two steps until convergence:

E-Step (Expectation)

Given current parameters, compute the responsibility: the probability that each point was generated by each Gaussian.

M-Step (Maximization)

Given these probabilities, update μ, Σ, and π to maximize the likelihood of the data.

γᵢₖ=[πₖ×N(xᵢ | μₖ, Σₖ)]/Σⱼ[πⱼ × N(xᵢ | μⱼ, Σⱼ)]

Posterior probability of cluster k given point xᵢ (Bayes' theorem)

gmm_em.py
1def _e_step(self, X):
2 n_samples = X.shape[0]
3 responsibilities = np.zeros((n_samples, self.n_components))
4
5 for k in range(self.n_components):
6 dist = multivariate_normal(mean=self.means_[k], cov=self.covariances_[k])
7 responsibilities[:, k] = self.weights_[k] * dist.pdf(X)
8
9 # Normalize to get probabilities (sum to 1 across clusters)
10 responsibilities /= responsibilities.sum(axis=1, keepdims=True)
11 return responsibilities
12
13def _m_step(self, X, responsibilities):
14 n_samples, n_features = X.shape
15 Nk = responsibilities.sum(axis=0) # Effective points per cluster
16
17 # Update weights
18 self.weights_ = Nk / n_samples
19
20 # Update means (weighted average)
21 self.means_ = (responsibilities.T @ X) / Nk[:, np.newaxis]
22
23 # Update covariances
24 for k in range(self.n_components):
25 diff = X - self.means_[k]
26 self.covariances_[k] = (responsibilities[:, k][:, np.newaxis] * diff).T @ diff / Nk[k]
27 self.covariances_[k] += np.eye(n_features) * 1e-6 # Regularization
EM is guaranteed to increase (or keep constant) the log-likelihood at each iteration, so it always converges to a local optimum. Like K-Means, it's sensitive to initialization.

Comparing the Three

Each algorithm has different strengths and assumptions. The best choice depends entirely on your data's structure and your requirements.

Clustering Algorithm Comparison

See how different algorithms handle various data shapes

Dataset Shape:

Well-separated, spherical clusters

Algorithm:
Clusters Found
3
Fit Quality
Good Match ✓
KMEANS handles this shape well
Algorithm Info
Assumes spherical clusters of similar size. Fast but rigid.
AlgorithmCluster Shape# ClustersNoise HandlingSpeed
K-MeansSphericalMust specifyPoor (outliers skew centroids)Very fast
DBSCANArbitraryAutomaticExcellent (explicit noise detection)Medium
GMMEllipticalMust specifyModerate (low probability)Slower

K-Means

Best for: Well-separated, spherical clusters of similar size. When speed matters and you know k.

DBSCAN

Best for: Arbitrary-shaped clusters, noisy data, unknown number of clusters. When outlier detection matters.

GMM

Best for: Overlapping clusters, soft assignments needed, elliptical cluster shapes. Probabilistic interpretation.

Evaluating Clusters

Unlike supervised learning, we don't have ground truth labels to measure accuracy. How do we know if our clustering is good? Several metrics help evaluate cluster quality:

Silhouette Score

For each point, the silhouette score measures how similar it is to its own cluster compared to other clusters:

s(i)=(b(i)a(i))/max(a(i), b(i))

a(i) = mean distance to own cluster, b(i) = mean distance to nearest other cluster

1Point is very close to its cluster and far from others (perfect)
0Point is on the border between clusters
-1Point is probably in the wrong cluster
1

Calinski-Harabasz Index

Ratio of between-cluster variance to within-cluster variance. Higher is better—favors clusters that are dense and well-separated.

2

Davies-Bouldin Index

Average similarity between each cluster and its most similar cluster. Lower is better—minimize within-cluster scatter, maximize between-cluster separation.

When to Use What

Choose K-Means When

  • • Clusters are roughly spherical
  • • You know the number of clusters
  • • Speed is critical (large datasets)
  • • Clusters are well-separated
  • • You need a simple baseline

Choose DBSCAN When

  • • Clusters have arbitrary shapes
  • • You don't know the number of clusters
  • • Data contains noise and outliers
  • • Clusters have varying densities
  • • You need outlier detection

Choose GMM When

  • • Clusters are elliptical
  • • You need soft/probabilistic assignments
  • • Clusters overlap significantly
  • • Data is Gaussian-ish
  • • You need uncertainty estimates

Common Pitfalls

  • • Not scaling features before clustering
  • • Using K-Means for non-spherical data
  • • Wrong ε/minPts for DBSCAN
  • • Too many components in GMM
  • • Ignoring the elbow method

Final Thoughts

The clustering algorithm you choose depends entirely on your dataset, task at hand, and constraints. Though it may require some experimentation, these three algorithms cover a wide range of use cases. When in doubt, start with K-Means as a baseline, try DBSCAN if you see noise or non-spherical shapes, and consider GMM when you need probabilistic assignments.

Happy coding, and don't get too clustered.