Contents
This one was a bit math heavy, but here we go. Support Vector Machines (SVMs) are one of the most elegant algorithms in machine learning. While we've built classifiers that find a decision boundary, SVMs find the best decision boundary—one with maximum margin from both classes.
The beauty of SVMs lies in their mathematical foundation. They transform a seemingly simple geometric idea into a sophisticated optimization problem, then use a clever trick (the kernel trick) to handle non-linear patterns. Understanding SVMs deeply means understanding convex optimization, Lagrange multipliers, and duality theory.
Let's build an SVM from scratch and work through all the math. It's going to get intense, but I'll try my best to make it as intuitive as possible.
The Maximum Margin Intuition
Finding the optimal decision boundary
Imagine you have two classes of data points that are linearly separable. There are infinitely many lines you could draw to separate them. Which one should you choose?
SVMs choose the line that's as far as possible from the nearest points of both classes. This distance is called the margin, and the line that maximizes it is the maximum margin hyperplane.
Why maximize the margin?
Intuitively, a larger margin means the classifier is more “confident” and should generalize better to new data. Points near the decision boundary are uncertain; staying far from them reduces the chance of misclassification due to noise or variation.
Interactive: Maximum Margin Hyperplane
Click on the canvas to add points. The boundary updates in real-time.
The points that lie exactly on the margin boundaries are called support vectors. These are the critical points that completely determine the decision boundary. You could remove all other points and get the same hyperplane. This is why the algorithm is called Support Vector Machine.
Support Vectors Are Key
The Primal Optimization Problem
Formalizing the optimization
Let's formalize this. We have training data {(x₁, y₁), …, (xₙ, yₙ)} where xᵢ ∈ ℝᵈ and yᵢ ∈ {−1, +1}.
We want to find a hyperplane defined by:
Hyperplane equation: w is normal vector, b is bias
For a point xᵢ, the signed distance to the hyperplane is:
Signed distance from point to hyperplane
For the hyperplane to correctly classify all points with margin γ:
The yᵢ ensures we're measuring distance on the correct side (positive for +1 class, negative for -1 class). By convention, we scale w and b so that the closest points (support vectors) satisfy yᵢ(w · xᵢ + b) = 1.
This sets γ = 1/||w||, so maximizing the margin becomes:
The (1/2) is for mathematical convenience (makes the derivative cleaner). This is a convex quadratic programming problem with linear inequality constraints.
Soft Margin: Allowing Mistakes
Handling real-world data
Real data isn't always linearly separable. We need to allow some points to be on the wrong side of the margin or even misclassified. We introduce slack variables ξᵢ ≥ 0 for each point:
Point correctly classified with full margin.
Correctly classified but inside the margin.
Misclassified (on wrong side of boundary).
Now we want to minimize margin violations while still maximizing the margin:
The parameter C controls the tradeoff between margin size and violations:
Large C (> 100)
Few violations allowed, focus on classifying training data correctly. Narrow margin. May overfit.
Small C (< 1)
More violations allowed, focus on maximizing margin. Wider margin. May underfit.
Lagrangian and Duality
The mathematical machinery
Solving the primal problem directly is hard with inequality constraints. This is where Lagrange multipliers and duality theory come in. The Lagrangian function incorporates the constraints via multipliers αᵢ ≥ 0:
1# Lagrangian for soft-margin SVM2L(w, b, ξ, α, β) = (1/2)||w||² + C * Σξᵢ 3 - Σαᵢ[yᵢ(w·xᵢ + b) - 1 + ξᵢ] 4 - ΣβᵢξᵢTo find the optimal point, we take derivatives with respect to the primal variables (w, b, ξ) and set them to zero. These are the KKT (Karush-Kuhn-Tucker) stationarity conditions:
Weights are linear combination of support vectors
Sum of weighted labels must be zero
Alphas bounded by C
Substituting these back into the Lagrangian gives us the dual problem:
This is beautiful because instead of optimizing over w (d-dimensional), we're optimizing over α (n-dimensional, but sparse). The dual problem only depends on inner products xᵢ·xⱼ—this will be crucial for the kernel trick.
Understanding the Dual Variables
• αᵢ = 0: Point is correctly classified outside the margin (doesn't affect boundary)
• 0 < αᵢ < C: Point is a support vector on the margin boundary
• αᵢ = C: Point violates the margin (bounded support vector)
Sequential Minimal Optimization (SMO)
Efficient optimization
Now we need to solve the dual problem. Traditional quadratic programming solvers are slow. The SMO algorithm, invented by John Platt, is much faster.
The key insight: if we fix all α values except two, the problem has an analytical closed-form solution. SMO repeatedly selects pairs ofα values and optimizes them jointly.
Why Two Variables?
We can't optimize one α at a time because of the constraint Σαᵢyᵢ = 0. Changing one α would violate this. But with two variables, we can write:
This means α₁ is determined by α₂, giving us a one-dimensional optimization problem.
1def smo_algorithm(X, y, C, tol=1e-3, max_passes=5):2 """Sequential Minimal Optimization for SVM."""3 n_samples = len(y)4 alpha = np.zeros(n_samples)5 b = 06 E = -y.copy() # Error cache7 8 passes = 09 while passes < max_passes:10 num_changed = 011 12 for i in range(n_samples):13 # Check KKT conditions for alpha[i]14 Ei = E[i]15 16 if (y[i] * Ei < -tol and alpha[i] < C) or \17 (y[i] * Ei > tol and alpha[i] > 0):18 19 # Select j != i (maximize |Ei - Ej|)20 j = select_second_alpha(i, Ei, E)21 Ej = E[j]22 23 # Save old alphas24 alpha_i_old = alpha[i]25 alpha_j_old = alpha[j]26 27 # Compute bounds L and H28 if y[i] != y[j]:29 L = max(0, alpha[j] - alpha[i])30 H = min(C, C + alpha[j] - alpha[i])31 else:32 L = max(0, alpha[i] + alpha[j] - C)33 H = min(C, alpha[i] + alpha[j])34 35 if L == H:36 continue37 38 # Compute eta (second derivative)39 eta = 2 * K[i,j] - K[i,i] - K[j,j]40 if eta >= 0:41 continue42 43 # Update alpha[j]44 alpha[j] = alpha_j_old - y[j] * (Ei - Ej) / eta45 alpha[j] = np.clip(alpha[j], L, H)46 47 # Update alpha[i]48 alpha[i] = alpha_i_old + y[i]*y[j]*(alpha_j_old - alpha[j])49 50 # Update bias and error cache51 b = update_bias(b, alpha, i, j, Ei, Ej, ...)52 E = update_errors(E, alpha, X, y, b)53 54 num_changed += 155 56 if num_changed == 0:57 passes += 158 else:59 passes = 060 61 return alpha, bThe analytical solution for updating α₂ is:
Where η = 2K(x₁,x₂) - K(x₁,x₁) - K(x₂,x₂) and Eᵢ is the prediction error
After updating α₂ (clipped to bounds [L, H]), we compute α₁:
1# Update alpha_1 from the constraint2alpha_1_new = alpha_1_old + y1 * y2 * (alpha_2_old - alpha_2_new)34# Update bias using KKT conditions5b1 = b - E1 - y1*(alpha_1_new - alpha_1_old)*K[i1,i1] \6 - y2*(alpha_2_new - alpha_2_old)*K[i1,i2]7b2 = b - E2 - y1*(alpha_1_new - alpha_1_old)*K[i1,i2] \8 - y2*(alpha_2_new - alpha_2_old)*K[i2,i2]910if 0 < alpha_1_new < C:11 b_new = b112elif 0 < alpha_2_new < C:13 b_new = b214else:15 b_new = (b1 + b2) / 2The Kernel Trick
The magic of implicit feature maps
Here's where things get really cool. Look at the dual formulation—it only uses inner productsxᵢ·xⱼ. What if we replace those inner products with a more general function K(xᵢ, xⱼ)?
If K is a valid kernel (satisfies Mercer's condition), it corresponds to an inner product in some (possibly infinite-dimensional) feature space:
The magic: we never need to compute φ(x) explicitly! The kernel computes the inner product in the transformed space directly.
The Kernel Trick Visualized
Different kernels work best for different data distributions
Finds a straight line (hyperplane) to separate classes. Works when clusters are on opposite sides.
The kernel trick computes inner products in transformed space without explicitly mapping points
Popular kernels include:
def linear_kernel(x1, x2):
return np.dot(x1, x2)No transformation. Standard linear SVM.
def poly_kernel(x1, x2, d=3, r=1):
return (r + np.dot(x1, x2)) ** dMaps to polynomial feature space. Captures curves and interactions.
def rbf_kernel(x1, x2, gamma=0.1):
return np.exp(-gamma * np.linalg.norm(x1 - x2) ** 2)Maps to infinite dimensions! Most versatile kernel.
Why Kernels Work
Hyperparameters and Tuning
Tuning for performance
SVMs have several key hyperparameters that can play a large role in how the model trains:
Interactive: Hyperparameter Effects
Adjust the parameters to see how they affect the decision boundary and model complexity
Wide margin, more violations allowed. May underfit.
Balanced tradeoff. Good default starting point.
Narrow margin, fewer violations. May overfit.
| Parameter | Effect | Tuning Guide |
|---|---|---|
| C | Regularization strength | Start with 1, try {0.1, 1, 10, 100} |
| γ (gamma) | RBF kernel width | Start with 1/n_features, try {0.01, 0.1, 1, 10} |
| degree | Polynomial degree | Usually 2 or 3, higher prone to overfitting |
| kernel | Feature mapping | Start with RBF, try linear if data is high-dim |
A general tuning strategy:
Start with RBF kernel (most versatile)
Grid search C ∈ {0.1, 1, 10, 100} and γ ∈ {0.01, 0.1, 1, 10}
Use cross-validation to find the best combination
If RBF doesn't work well, try polynomial or linear
Look at support vector count—too few = underfitting, too many = overfitting
When to Use SVMs
SVMs are not the flashiest models in 2025, but they remain devastatingly useful. They shine in specific situations:
✓ SVMs Shine When
- • Dataset isn't huge (100s to 10,000s of samples)
- • Many more features than samples (high-dimensional)
- • You need a strong baseline before going deep
- • Clear margin of separation exists
- • You want probabilistic outputs (with Platt scaling)
- • Memory efficiency matters at prediction time
✗ Avoid SVMs When
- • Dataset is massive (millions of samples)
- • Training time scales poorly (O(n²) to O(n³))
- • You need online learning (streaming data)
- • Kernel selection is unclear
- • Data is very noisy with lots of overlap
- • Interpretability of features is critical
Beyond the Basics
Though we've built SVMs from scratch, here are various ways to extend and optimize:
Multi-class SVMs
SVMs are inherently binary classifiers. For multi-class, use one-vs-rest (OvR) or one-vs-one (OvO) strategies. Sklearn handles this automatically.
SVM Regression (SVR)
SVMs can do regression too! Instead of maximizing margin between classes, SVR fits a tube of width ε around the data and penalizes points outside it.
Online SVMs
Pegasos and other algorithms enable stochastic gradient descent for SVMs, making them feasible for larger datasets.
Kernel Approximation
Random Fourier Features and Nyström approximation let you approximate RBF kernels in linear time, enabling SVMs on millions of samples.
Platt Scaling
SVMs output distances, not probabilities. Platt scaling fits a sigmoid to convert SVM outputs to calibrated probability estimates.
Happy coding, and I hope you got the support you needed.