Contents
By now, you've seen datasets with 10, 50, maybe even 100 features. But what does that look like? How do you visualize a point in 100-dimensional space? How do you understand the structure of data when you can't even plot it?
The answer is that you can't. At least not directly. Humans are stuck in 3D, and even 3D is pushing it for most visualizations. This is where dimensionality reduction comes in—taking high-dimensional data and finding a lower-dimensional representation that preserves what's important.
Today I'm building four fundamental approaches from scratch: PCA (variance-based), LDA (class-separation-based), t-SNE (neighborhood-based), and UMAP (topology-based). Each takes a different philosophy on what's important to preserve when squashing down dimensions.
The Dimensionality Reduction Problem
At its core, dimensionality reduction is a mapping problem. You have data in d dimensions, and you want to find a representation in k dimensions (k < d) that preserves some important property of the data.
The Formal Problem
Given X ∈ ℝⁿˣᵈ (n samples, d features), find a transformation f such that Y = f(X) where Y ∈ ℝⁿˣᵏ, and Y preserves some property of X.
The challenge is that you're throwing away information—going from d dimensions to k dimensions means losing (d−k) dimensions worth of data. The art is in losing the right information.
Why Reduce Dimensions?
More than just visualization
Visualization
Humans can't visualize beyond 3D. Reducing to 2D or 3D lets us see patterns, clusters, and outliers that are invisible in high dimensions.
Computational Efficiency
Algorithms scale badly with dimensions. Training models, computing distances, and running clustering all speed up dramatically with fewer features.
Curse of Dimensionality
In high dimensions, all points become equidistant. A 100D unit hypercube has 99.99% of its volume in the corners. Distance-based methods like KNN stop working.
Noise Reduction
Often, high-dimensional data lives on a lower-dimensional manifold. The extra dimensions are just noise. Finding the underlying manifold removes noise.
Linear vs Nonlinear Methods
Linear methods (PCA, LDA) find linear projections—they multiply your data by a matrix. They're fast and well-understood but can only capture linear relationships.
Nonlinear methods (t-SNE, UMAP) can capture curved, twisted manifolds. They're more powerful but slower and harder to interpret. The right choice depends on your data structure and what you're optimizing for.
PCA: Maximizing Variance
Finding directions of maximum variance
Principal Component Analysis is the workhorse of dimensionality reduction. The idea is beautifully simple: find directions in feature space where the data varies the most, and project onto those directions.
The Intuition
Imagine data on people's heights and weights. These are correlated—tall people tend to be heavier. If you plot them, you get an elongated ellipse. The first principal component points along the long axis (maximum variance). The second points along the short axis (perpendicular, second-most variance).
Interactive: PCA Visualization
Drag the slider to rotate the projection axis. PC1 captures maximum variance.
Data Shape
Display
Explained Variance
PCA finds the directions of maximum variance by analyzing the covariance matrix. The covariance matrix captures how features vary together. If two features tend to increase together, they have positive covariance. If one increases when the other decreases, negative covariance. If they vary independently, zero covariance.
What most people miss is that the covariance matrix encodes the shape of your data. If your data forms an elongated ellipse, the covariance matrix captures the direction of elongation. The eigenvectors point along the axes of the ellipse; the eigenvalues tell you how stretched it is in each direction.
The PCA Algorithm
1def fit(self, X):2 # Step 1: Center the data3 self.mean_ = np.mean(X, axis=0)4 X_centered = X - self.mean_5 6 # Step 2: Compute covariance matrix7 cov_matrix = np.cov(X_centered.T)8 9 # Step 3: Eigendecomposition10 eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)11 12 # Step 4: Sort by eigenvalue (descending)13 idx = np.argsort(eigenvalues)[::-1]14 eigenvalues = eigenvalues[idx]15 eigenvectors = eigenvectors[:, idx]16 17 # Step 5: Keep top k components18 self.components_ = eigenvectors[:, :self.n_components].T19 self.explained_variance_ = eigenvalues[:self.n_components]20 21 return self2223def transform(self, X):24 # Step 6: Project onto principal components25 X_centered = X - self.mean_26 return X_centered @ self.components_.TThe direction w that maximizes this is the largest eigenvector of C
Interactive: Explained Variance (Scree Plot)
Choose enough components to capture ~90-95% of variance
Higher = variance concentrated in fewer components
Suggested K
2
Elbow method suggests keeping 2 components (52.8% variance)
PCA Limitations
LDA: Maximizing Class Separation
Supervised dimensionality reduction
Linear Discriminant Analysis is like PCA's supervised cousin. Instead of maximizing variance, it maximizes the separation between classes while minimizing variance within classes. This makes it perfect for preprocessing before classification.
The Intuition
Imagine two classes of points. PCA might find that the direction of maximum variance goes right through both classes, mixing them together. LDA finds the direction that best separates the classes—pulling class means apart while keeping each class tight.
Interactive: LDA vs PCA Class Separation
LDA maximizes class separation while PCA just finds maximum variance
1D Projection Comparison
Notice how LDA separates the classes better, while PCA might mix them.
Class Configuration
Display Options
LDA optimizes the ratio of between-class scatter to within-class scatter. Between-class scatter measures how far apart the class means are. Within-class scatter measures how spread out points are within each class. We want high between-class scatter (classes separated) and low within-class scatter (classes compact).
Sᴮ = between-class scatter, Sᵂ = within-class scatter
1def fit(self, X, y):2 classes = np.unique(y)3 n_classes = len(classes)4 5 # Compute class means and global mean6 self.mean_ = np.mean(X, axis=0)7 self.class_means_ = {}8 for c in classes:9 self.class_means_[c] = np.mean(X[y == c], axis=0)10 11 # Within-class scatter matrix12 within_class_scatter = np.zeros((X.shape[1], X.shape[1]))13 for c in classes:14 X_c = X[y == c]15 within_class_scatter += np.cov(X_c.T) * (len(X_c) - 1)16 17 # Between-class scatter matrix18 between_class_scatter = np.zeros((X.shape[1], X.shape[1]))19 for c in classes:20 n_c = np.sum(y == c)21 mean_diff = (self.class_means_[c] - self.mean_).reshape(-1, 1)22 between_class_scatter += n_c * (mean_diff @ mean_diff.T)23 24 # Solve generalized eigenvalue problem: Sb*v = λ*Sw*v25 eigenvalues, eigenvectors = eigh(between_class_scatter, within_class_scatter)26 27 # Take top (n_classes - 1) components28 idx = np.argsort(eigenvalues)[::-1]29 self.components_ = eigenvectors[:, idx[:self.n_components]].T30 31 return selft-SNE: Preserving Local Structure
Neighborhood preservation through probability
t-distributed Stochastic Neighbor Embedding takes a completely different approach. Instead of linear projections, it uses gradient descent to position points in low dimensions such that nearby points in high dimensions stay nearby, and far points stay far.
The Intuition
t-SNE models pairwise similarities. In high-D space, it computes “how similar is point i to point j?” as a probability. In low-D space, it tries to match these probabilities. The algorithm iteratively adjusts positions to minimize the difference between high-D and low-D similarity distributions.
Interactive: t-SNE Unrolling Manifolds
Compare how PCA vs t-SNE handle non-linear structure
PCA (linear) projects to the plane of maximum variance, but this mixes up points that are far apart along the manifold. Notice how different colors overlap—they shouldn't!
High-Dimensional Similarities: For each point i, we compute a probability distribution over all other points based on Gaussian similarity. The bandwidth σᵢ is chosen for each point to match a target perplexity—roughly the “effective number of neighbors.”
Conditional probability that point i would pick point j as a neighbor
Low-Dimensional Similarities: In the embedding, we use a t-distribution (Student's t with 1 degree of freedom) instead of Gaussian. The heavier tails help avoid the “crowding problem” where dissimilar points get squeezed together.
t-distribution has heavier tails than Gaussian—allows distant points to be far
The objective is to minimize KL divergence between P (high-D) and Q (low-D):
1def fit_transform(self, X):2 n = len(X)3 4 # Compute high-D pairwise similarities (with perplexity)5 P = self._compute_pairwise_affinities(X, self.perplexity)6 P = (P + P.T) / (2 * n) # Symmetrize7 8 # Initialize embedding randomly9 Y = np.random.randn(n, self.n_components) * 0.0110 11 # Gradient descent12 for iteration in range(self.n_iter):13 # Compute low-D similarities (t-distribution)14 Q = self._compute_q_distribution(Y)15 16 # Compute gradient: ∂KL/∂yᵢ = 4 Σⱼ (pᵢⱼ - qᵢⱼ)(yᵢ - yⱼ)(1 + ||yᵢ-yⱼ||²)⁻¹17 gradients = np.zeros_like(Y)18 for i in range(n):19 diff = Y[i] - Y # (n, 2)20 dist_sq = np.sum(diff ** 2, axis=1)21 q_factor = (1 / (1 + dist_sq))22 pq_diff = (P[i] - Q[i]) * q_factor23 gradients[i] = 4 * np.sum(pq_diff[:, np.newaxis] * diff, axis=0)24 25 # Update positions26 Y = Y - self.learning_rate * gradients27 28 return Yt-SNE Quirks
UMAP: Topology-Preserving Embedding
Faster, better global structure
UMAP (Uniform Manifold Approximation and Projection) arrived in 2018. Like t-SNE, it preserves local structure, but it's faster, preserves more global structure, and has better theoretical foundations in topology and manifold theory.
The Intuition
UMAP assumes your data lies on a manifold embedded in high-dimensional space. It constructs a fuzzy k-nearest-neighbor graph representing the manifold topology, then finds a low-dimensional layout of that graph. Edge weights represent “fuzzy membership”—not binary in/out, but partial membership strength.
Interactive: UMAP Cluster Separation
UMAP finds 2D layouts preserving cluster structure from 3D
Fuzzy Membership: Instead of “point j is a neighbor of i” (yes/no), UMAP says “point j has 0.8 membership in i's neighborhood.” Closer points get higher membership; distant points get membership near 0. The membership function adapts to each point's local density.
1def fit_transform(self, X):2 n = len(X)3 4 # Step 1: Build fuzzy k-NN graph5 # For each point, find k nearest neighbors6 # Compute local radius ρ (distance to nearest neighbor)7 # Compute local scale σ (from perplexity-like target)8 9 for i in range(n):10 distances = compute_distances(X[i], X)11 neighbors = np.argsort(distances)[1:self.n_neighbors + 1]12 13 rho = distances[neighbors[0]] # Distance to nearest14 sigma = binary_search_for_sigma(distances, neighbors)15 16 # Fuzzy membership: exp(-max(0, d - ρ) / σ)17 for j in neighbors:18 weight = np.exp(-max(0, distances[j] - rho) / sigma)19 graph[i, j] = weight20 21 # Symmetrize via fuzzy union: a + b - a*b22 graph = graph + graph.T - graph * graph.T23 24 # Step 2: Optimize low-D layout (minimize cross-entropy)25 Y = initialize_embedding(n)26 27 for epoch in range(n_epochs):28 # Attraction: pull connected points together29 # Repulsion: push disconnected points apart30 Y = gradient_step(Y, graph, learning_rate)31 32 return Y| Parameter | Effect | Default |
|---|---|---|
| n_neighbors | Local vs global structure | 15 |
| min_dist | How tightly points cluster | 0.1 |
| metric | Distance metric for neighbors | euclidean |
Comparing the Four
Each algorithm has different strengths and assumptions. The best choice depends on your data structure, whether you have labels, and what you want to preserve.
| Method | Type | Preserves | Speed | Labels? |
|---|---|---|---|---|
| PCA | Linear | Variance | Very Fast | No |
| LDA | Linear | Class Separation | Very Fast | Yes |
| t-SNE | Nonlinear | Local Neighborhoods | Slow | No |
| UMAP | Nonlinear | Topology | Fast | No* |
PCA
- ✓ Fast, deterministic, interpretable
- ✓ Components have meaning (variance directions)
- ✓ Inverse transform exists
- ✗ Only captures linear structure
- ✗ Ignores class labels
LDA
- ✓ Uses class labels for better separation
- ✓ Great preprocessing for classification
- ✓ Fast and deterministic
- ✗ Assumes Gaussian classes
- ✗ Limited to (k−1) dimensions for k classes
t-SNE
- ✓ Excellent cluster visualization
- ✓ Captures nonlinear structure
- ✓ Reveals hidden clusters
- ✗ Slow (O(n²))
- ✗ Non-deterministic
- ✗ Global distances meaningless
UMAP
- ✓ Fast (O(n log n))
- ✓ Better global structure than t-SNE
- ✓ Scales to millions of points
- ✓ Theoretical foundation
- ✗ Hyperparameter sensitive
- ✗ Can create false structure
When to Use What
Use PCA when...
- •You need a fast, interpretable baseline
- •Data is roughly linear
- •You want to preserve global variance
- •You need inverse transform (reconstruction)
- •Preprocessing for another algorithm
Use LDA when...
- •You have class labels and care about classification
- •You want maximum class separation
- •Data is roughly Gaussian per class
- •Preprocessing for a classifier
Use t-SNE when...
- •Visualization is the primary goal
- •You want to reveal cluster structure
- •Dataset is < 10,000 points
- •You're okay with non-deterministic results
Use UMAP when...
- •Large dataset (t-SNE would be too slow)
- •You need some global structure preservation
- •Preparing for downstream clustering
- •You want faster iterations with hyperparameters
Final Thoughts
Happy coding, and I hope you UMAP your dimensions correctly.