Contents
Everything we've built so far has been dependent on pre-existing labels—training models to predict future ones. Unfortunately, in the real world, data labels are both expensive and inaccessible. What if you don't have labels? What if you just have data and want to find natural groupings or patterns within it?
This is when we use clustering. Unlike the supervised models we've seen until now, this is an unsupervised learning approach that finds structure in unlabeled data. Today I'm building three fundamentally different clustering algorithms from scratch: K-Means (the fast and simple standard), DBSCAN (for weird-shaped clusters and noise), and Gaussian Mixture Models (a probabilistic alternative).
The Clustering Problem
At its core, clustering is about grouping similar things together. Given a dataset with no labels, we want to partition it into groups (clusters) such that points within a group are more similar to each other than to points in other groups.
The General Problem
Given n data points in d-dimensional space, assign each point to one of k groups (or mark it as noise) such that some objective is optimized.
But what does “similar” mean? And what makes a good grouping? That's where different clustering algorithms diverge—they each have different notions of similarity and cluster quality.
Customer Segmentation
Group customers by purchasing behavior to target marketing campaigns. E-commerce companies cluster users into “bargain hunters,” “brand loyal,” “impulse buyers,” etc.
Image Segmentation
Partition an image into regions. Medical imaging uses clustering to separate tissues, tumors, or organs in scans.
Document Organization
Group similar documents together. News aggregators cluster articles about the same story from different sources.
Anomaly Detection
Identify outliers that don't fit any cluster. Fraud detection systems use this to find unusual transactions.
K-Means Clustering
The Classic Centroid-Based Approach
K-Means is probably the most widely used clustering algorithm. The idea is beautifully simple: a cluster is defined by its center point (centroid), and each data point belongs to whichever centroid it's closest to.
Interactive K-Means Clustering
Watch centroids move and points reassign until convergence
The Algorithm
1def _initialize_centroids(self, X):2 n_samples, n_features = X.shape3 indices = np.random.choice(n_samples, self.n_clusters, replace=False)4 return X[indices]56def _assign_clusters(self, X, centroids):7 distances = cdist(X, centroids, metric='euclidean')8 return np.argmin(distances, axis=1)910def _update_centroids(self, X, labels):11 centroids = np.zeros((self.n_clusters, X.shape[1]))12 for k in range(self.n_clusters):13 cluster_points = X[labels == k]14 if len(cluster_points) > 0:15 centroids[k] = cluster_points.mean(axis=0)16 return centroidsThe cdist function computes the pairwise Euclidean distances between all points and all centroids. For each point, argmin gives us the index of the closest centroid.
1def fit(self, X):2 self.centroids_ = self._initialize_centroids(X)3 4 for iteration in range(self.max_iters):5 labels = self._assign_clusters(X, self.centroids_)6 new_centroids = self._update_centroids(X, labels)7 8 # Check for convergence9 centroid_shift = np.sum((new_centroids - self.centroids_) ** 2)10 if centroid_shift < self.tol:11 self.converged_ = True12 break13 14 self.centroids_ = new_centroids15 16 self.labels_ = self._assign_clusters(X, self.centroids_)17 return selfWhy K-Means Works
K-Means is actually optimizing an objective function called inertia (or within-cluster sum of squares):
Sum of squared distances from each point to its cluster centroid
Where μᵢ is the centroid of cluster Cᵢ. The assignment and update steps are performing coordinate descent on this objective, which is guaranteed to decrease (or stay the same) at each iteration.
K-Means Limitations
Choosing K: How Many Clusters?
The Elbow Method
Choosing k is critical. If it's too low, you'll lump distinct groups together. If it's too high, you'll split natural clusters into artificial subgroups. The elbow method helps you find the sweet spot.
The Elbow Method
Finding the optimal number of clusters
The idea: start with k = 1 and iterate upward, calculating inertia at each step. Plot inertia vs. k—you'll see it decreasing. Look for the “elbow” where adding more clusters stops significantly reducing inertia. That's your optimal k.
DBSCAN: Finding Clusters by Density
Density-Based Clustering
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) takes a completely different approach. Instead of defining clusters by centroids, it defines them by density—a cluster is a region where points are densely packed together, separated by regions of low density.
Core Points
Points with at least minPts neighbors within radius ε. They form the “bulk” of clusters.
Border Points
Within ε of a core point but don't have enough neighbors themselves. They're on the edges.
Noise Points
Not core points and not within ε of any core point. These are outliers.
Interactive DBSCAN Clustering
Density-based clustering with core, border, and noise points
1def fit(self, X):2 n_samples = X.shape[0]3 self.labels_ = np.full(n_samples, -1, dtype=int) # -1 = noise4 cluster_id = 05 6 for point_idx in range(n_samples):7 if self.labels_[point_idx] != -1: # Already processed8 continue9 10 neighbors = self._get_neighbors(X, point_idx)11 12 if len(neighbors) < self.min_samples: # Not a core point13 continue # Stays as noise (may become border later)14 15 # Start a new cluster16 self.labels_[point_idx] = cluster_id17 seed_set = list(neighbors)18 i = 019 20 while i < len(seed_set):21 neighbor_idx = seed_set[i]22 23 if self.labels_[neighbor_idx] == -1:24 self.labels_[neighbor_idx] = cluster_id25 26 neighbor_neighbors = self._get_neighbors(X, neighbor_idx)27 if len(neighbor_neighbors) >= self.min_samples:28 for new_neighbor in neighbor_neighbors:29 if new_neighbor not in seed_set:30 seed_set.append(new_neighbor)31 32 i += 133 34 cluster_id += 1DBSCAN's genius is that it defines clusters by connectivity through density. Two points are in the same cluster if you can “walk” from one to the other through a chain of core points, where each step is at most ε distance.
| Parameter | Too Small | Too Large |
|---|---|---|
| ε (epsilon) | Everything is noise | Everything is one cluster |
| minPts | More noise, denser core requirement | Sparse clusters split, more border points |
DBSCAN Strengths
Gaussian Mixture Models
Probabilistic Clustering
Both K-Means and DBSCAN make hard assignments—each point belongs to exactly one cluster (or is noise). Gaussian Mixture Models (GMMs) take a probabilistic approach: each point has a probability of belonging to each cluster.
The Generative Model
GMM assumes your data was generated by sampling from a mixture of k Gaussian (normal) distributions, each representing a different cluster:
Interactive Gaussian Mixture Model
Probabilistic clustering with soft assignments
GMMs are trained using the Expectation-Maximization (EM) algorithm, which alternates between two steps until convergence:
E-Step (Expectation)
Given current parameters, compute the responsibility: the probability that each point was generated by each Gaussian.
M-Step (Maximization)
Given these probabilities, update μ, Σ, and π to maximize the likelihood of the data.
Posterior probability of cluster k given point xᵢ (Bayes' theorem)
1def _e_step(self, X):2 n_samples = X.shape[0]3 responsibilities = np.zeros((n_samples, self.n_components))4 5 for k in range(self.n_components):6 dist = multivariate_normal(mean=self.means_[k], cov=self.covariances_[k])7 responsibilities[:, k] = self.weights_[k] * dist.pdf(X)8 9 # Normalize to get probabilities (sum to 1 across clusters)10 responsibilities /= responsibilities.sum(axis=1, keepdims=True)11 return responsibilities1213def _m_step(self, X, responsibilities):14 n_samples, n_features = X.shape15 Nk = responsibilities.sum(axis=0) # Effective points per cluster16 17 # Update weights18 self.weights_ = Nk / n_samples19 20 # Update means (weighted average)21 self.means_ = (responsibilities.T @ X) / Nk[:, np.newaxis]22 23 # Update covariances24 for k in range(self.n_components):25 diff = X - self.means_[k]26 self.covariances_[k] = (responsibilities[:, k][:, np.newaxis] * diff).T @ diff / Nk[k]27 self.covariances_[k] += np.eye(n_features) * 1e-6 # RegularizationComparing the Three
Each algorithm has different strengths and assumptions. The best choice depends entirely on your data's structure and your requirements.
Clustering Algorithm Comparison
See how different algorithms handle various data shapes
Well-separated, spherical clusters
| Algorithm | Cluster Shape | # Clusters | Noise Handling | Speed |
|---|---|---|---|---|
| K-Means | Spherical | Must specify | Poor (outliers skew centroids) | Very fast |
| DBSCAN | Arbitrary | Automatic | Excellent (explicit noise detection) | Medium |
| GMM | Elliptical | Must specify | Moderate (low probability) | Slower |
K-Means
Best for: Well-separated, spherical clusters of similar size. When speed matters and you know k.
DBSCAN
Best for: Arbitrary-shaped clusters, noisy data, unknown number of clusters. When outlier detection matters.
GMM
Best for: Overlapping clusters, soft assignments needed, elliptical cluster shapes. Probabilistic interpretation.
Evaluating Clusters
Unlike supervised learning, we don't have ground truth labels to measure accuracy. How do we know if our clustering is good? Several metrics help evaluate cluster quality:
Silhouette Score
For each point, the silhouette score measures how similar it is to its own cluster compared to other clusters:
a(i) = mean distance to own cluster, b(i) = mean distance to nearest other cluster
Calinski-Harabasz Index
Ratio of between-cluster variance to within-cluster variance. Higher is better—favors clusters that are dense and well-separated.
Davies-Bouldin Index
Average similarity between each cluster and its most similar cluster. Lower is better—minimize within-cluster scatter, maximize between-cluster separation.
When to Use What
✓ Choose K-Means When
- • Clusters are roughly spherical
- • You know the number of clusters
- • Speed is critical (large datasets)
- • Clusters are well-separated
- • You need a simple baseline
✓ Choose DBSCAN When
- • Clusters have arbitrary shapes
- • You don't know the number of clusters
- • Data contains noise and outliers
- • Clusters have varying densities
- • You need outlier detection
✓ Choose GMM When
- • Clusters are elliptical
- • You need soft/probabilistic assignments
- • Clusters overlap significantly
- • Data is Gaussian-ish
- • You need uncertainty estimates
⚠ Common Pitfalls
- • Not scaling features before clustering
- • Using K-Means for non-spherical data
- • Wrong ε/minPts for DBSCAN
- • Too many components in GMM
- • Ignoring the elbow method
Final Thoughts
Happy coding, and don't get too clustered.