Contents
All of the models we've worked with involve a training phase where the algorithm learns parameters or builds a structure. But what if I told you there's an algorithm that doesn't really “learn” anything during training?
K-Nearest Neighbors (KNN) is beautifully simple. In order to classify a new point, just look at the k closest training examples and let them vote. No weights to optimize, no trees to grow, no parameters to tune beyond choosing k itself. It's one of the most intuitive algorithms in machine learning, and yet it's surprisingly powerful.
The Intuition
The assumption behind KNN is simple: similar things should be close together in feature space. If you want to know what a new data point is, look at its neighbors. Birds of a feather flock together.
For Classification
Find the k nearest neighbors, see which class is most common among them, and predict that class.
For Regression
Find the k nearest neighbors, average their values, and predict that average.
Interactive KNN Classification
Click anywhere to classify a point. Watch its neighbors vote!
Click anywhere in the visualization to place a test point
Distance Metrics
Measuring “Nearness”
The entire algorithm hinges on one question: what does “nearest” mean? We need a way to measure distance between points in feature space.
Euclidean Distance
The most common metric is Euclidean distance, which is just the straight-line distance you learned in geometry:
Square root of sum of squared differences
Manhattan Distance
Manhattan distance (also called L1 distance or taxicab distance) measures distance as if you're walking on a grid of city blocks:
Sum of absolute differences
Minkowski Distance
Minkowski distance is a generalization that includes both Euclidean and Manhattan as special cases:
Generalized distance with parameter p
Distance Metrics Comparison
See how different metrics measure "distance"
Straight-line distance (as the crow flies)
The parameter p controls how much we penalize differences in individual dimensions. Higher p means we care more about the single largest difference.
| Metric | When to Use | Properties |
|---|---|---|
| Euclidean (p=2) | Default choice, continuous features | Sensitive to outliers, assumes comparable scales |
| Manhattan (p=1) | High dimensions, different feature types | Less sensitive to outliers, good for grid-like data |
| Minkowski (p>2) | Custom tuning needed | Flexible, approaches max difference as p increases |
Feature Scaling Matters
The KNN Algorithm: Step by Step
Beautifully Lazy
The training phase is dead simple: we literally just store the training data. This is called “lazy learning” or “instance-based learning” because all the work happens at prediction time.
1def fit(self, X, y):2 """Store training data. That's it."""3 self.X_train = np.array(X)4 self.y_train = np.array(y).flatten()5 self.n_classes_ = len(np.unique(self.y_train))6 return selfThe prediction phase is where all the computation happens:
Calculate distances to all training points
Sort by distance
Select the k nearest neighbors
Vote (classification) or average (regression)
1def _get_neighbors(self, x):2 """Get k nearest neighbors for a single sample."""3 distances = []4 for i, x_train in enumerate(self.X_train):5 dist = self._calculate_distance(x, x_train)6 distances.append((dist, i))7 8 # Sort by distance and get k nearest9 distances.sort(key=lambda item: item[0])10 k_nearest = distances[:self.n_neighbors]11 return k_nearest1213def _calculate_distance(self, x1, x2):14 """Calculate distance based on chosen metric."""15 if self.metric == 'euclidean':16 return np.sqrt(np.sum((x1 - x2) ** 2))17 elif self.metric == 'manhattan':18 return np.sum(np.abs(x1 - x2))19 else: # minkowski20 return np.power(np.sum(np.abs(x1 - x2) ** self.p), 1/self.p)1def predict(self, X):2 """Predict class labels for samples in X."""3 X = np.array(X)4 predictions = []5 6 for x in X:7 neighbors = self._get_neighbors(x)8 neighbor_labels = [self.y_train[idx] for _, idx in neighbors]9 10 # Majority vote11 most_common = Counter(neighbor_labels).most_common(1)[0][0]12 predictions.append(most_common)13 14 return np.array(predictions)Choosing K: The Bias-Variance Tradeoff
The Only Hyperparameter
The hyperparameter k controls the bias-variance tradeoff in KNN:
k = 1 (very small)
High variance, low bias. The model is extremely sensitive to noise and outliers. Decision boundaries are jagged and can overfit.
k = moderate (5-20)
Balanced variance and bias. Smooth enough to ignore noise, flexible enough to capture patterns.
k = n (very large)
Low variance, high bias. Predicts the majority class everywhere. Extremely smooth but loses all local information.
Bias-Variance Tradeoff in KNN
See how k affects the decision boundary complexity
Balanced
Smooth enough to ignore noise, but flexible enough to capture the true pattern. A good default choice.
A good starting point is k = √n where n is the number of training samples. Use cross-validation to tune from there. For binary classification, use odd numbers to avoid ties. Generally, larger k for noisy data, smaller k when you have lots of clean data.
Weighted Voting
Not All Neighbors Are Equal
The basic KNN treats all k neighbors equally. But intuitively, closer neighbors should matter more than farther ones.
Uniform Weighting (default)
Each of the k neighbors gets one vote, regardless of distance.
Distance Weighting
Closer neighbors get more weight: weight = 1/distance
1def predict_weighted(self, X):2 """Predict using distance-weighted voting."""3 X = np.array(X)4 predictions = []5 6 for x in X:7 neighbors = self._get_neighbors(x)8 9 # Weight by inverse distance10 class_weights = {}11 for dist, idx in neighbors:12 label = self.y_train[idx]13 weight = 1 / (dist + 1e-8) # Avoid division by zero14 class_weights[label] = class_weights.get(label, 0) + weight15 16 # Predict class with highest total weight17 prediction = max(class_weights, key=class_weights.get)18 predictions.append(prediction)19 20 return np.array(predictions)The Curse of Dimensionality
The Achilles' Heel
Here's where KNN runs into serious problems: high-dimensional spaces. As dimensions increase, distances become less meaningful.
The Problem
- •All points become roughly equidistant from each other
- •The concept of “nearest” neighbor breaks down
- •You need exponentially more data to maintain the same density
This is because in d dimensions, the volume of a hypersphere grows as rd. To maintain the same density of points, you need exponentially more samples as d increases.
Points Needed to Cover Space (spacing = 0.1)
Solutions
When to Use KNN
KNN is often used as a baseline because it's so simple to implement and understand. If a more complex model can't beat KNN, something's probably wrong.
✓ KNN Shines When
- • Small to medium-sized datasets
- • Low-dimensional features (< 20)
- • Decision boundaries are irregular
- • You need interpretable predictions
- • Training time doesn't matter
- • Semi-supervised learning scenarios
✗ Avoid KNN When
- • High-dimensional data
- • You need fast predictions
- • Your dataset is huge (millions+)
- • Features are on different scales
- • Your data is sparse
- • More noise than signal
Beyond the Basics
KNN's simplicity is both its strength and weakness. Here are some techniques to address its limitations and scale it to real-world problems:
Ball Trees & KD-Trees
Data structures that enable O(log n) neighbor searches instead of O(n). Essential for large datasets.
Approximate Nearest Neighbors
Libraries like Annoy, Faiss, and ScaNN sacrifice some accuracy for massive speedups on million-scale datasets.
Locality Sensitive Hashing
Hash functions that map similar items to the same buckets, enabling sublinear search time.
Weighted Features
Not all features are equally important. Learn feature weights through cross-validation or gradient descent.
Kernel KNN
Use kernel functions to implicitly map data to higher dimensions where it might be more separable.
Happy coding, and keep your neighbors close.