Contents
Random forests taught us the power of ensembles. Combine many weak learners and you get a strong learner. But random forests build trees independently and average them. What if instead of building trees in parallel, we built them sequentially, where each new tree corrects the mistakes of all previous trees?
That's gradient boosting. It's one of the most powerful machine learning algorithms, dominating competitions and real-world applications. XGBoost, LightGBM, and CatBoost are all variations of this core idea. Today we're building gradient boosting from scratch to understand exactly how it works—from the initial prediction to the sequential tree fitting to the final ensemble.
This is going to be math-heavy, but I'll make it as intuitive as possible. By the end, you'll understand why gradient boosting is called “gradient” boosting and how it's essentially gradient descent in function space.
Interactive: Bagging vs Boosting
Compare the two fundamental ensemble approaches
Bagging (Random Forests)
Trees are built independently and in parallel. Each tree sees a different bootstrap sample of the data.
The Core Idea: Learning from Mistakes
Imagine you're trying to predict house prices. Your first model predicts $200k for every house (just using the mean). Obviously this is terrible—some houses should be $100k, others $500k.
Instead of starting over with a completely different model, what if you built a second model that predicts the errors of the first model? Then your combined prediction would be:
Each new model predicts what previous models got wrong
The second model doesn't try to predict house prices directly. It predicts “how much did the first model miss by?” If the first model predicted $200k for a house actually worth $300k, the second model learns to predict +$100k for that type of house.
Now you have a better model. But it's still not perfect. So you add a third model that predicts the errors of (Model₁ + Model₂). And a fourth that predicts the errors of (Model₁ + Model₂ + Model₃). Each model incrementally improves the ensemble by focusing on what previous models got wrong.
Interactive: Sequential Residual Fitting
Each tree fits the residuals (errors) of the previous ensemble
This is Boosting
Why Decision Trees?
Gradient boosting can work with any base learner—linear models, neural networks, even other ensembles—but decision trees are the standard choice. Why?
Non-linearity
Trees handle complex interactions without feature engineering
Weak by Design
Shallow trees (depth 3-5) underfit on their own—perfect for boosting
Fast Training
Trees train quickly, making sequential training feasible
Mixed Data Types
No need to encode categoricals or scale features
In gradient boosting, we typically use shallow regression trees(even for classification, as we'll see). These are stumps or small trees that output a single numeric value at each leaf. Deep trees would overfit and wouldn't ensemble well.
Gradient Boosting for Regression
Step-by-step implementation
The algorithm proceeds in steps. First, we initialize with a constant prediction:
Initialize with Mean
F₀(x) = mean(y)Start with the simplest possible prediction: the average of all target values. This minimizes MSE for a constant prediction.
1def fit(self, X, y):2 self.initial_prediction = np.mean(y)3 F = np.full(len(y), self.initial_prediction) # Current predictions4 self.trees = []5 6 for m in range(self.n_estimators):7 # Step 1: Compute residuals8 residuals = y - F9 10 # Step 2: Fit tree to residuals11 tree = DecisionTreeRegressor(max_depth=self.max_depth)12 tree.fit(X, residuals)13 self.trees.append(tree)14 15 # Step 3: Update predictions16 predictions = tree.predict(X)17 F += self.learning_rate * predictionsLet's break down what happens at each iteration:
Compute Residuals
rᵢ = yᵢ - Fₘ₋₁(xᵢ)The residual is how far off our current prediction is. For MSE loss, the negative gradient is exactly the residual.
Fit Tree to Residuals
hₘ(x) ← fit(X, residuals)Train a regression tree where the target is the residuals, not the original y. This tree learns "how much should I add to fix errors?"
Update the Ensemble
Fₘ(x) = Fₘ₋₁(x) + ν · hₘ(x)Add the tree's predictions, scaled by the learning rate ν, to our running ensemble prediction.
After M iterations, the final ensemble prediction is:
Sum of all tree corrections, scaled by learning rate
1def predict(self, X):2 # Start with the initial mean prediction3 F = np.full(len(X), self.initial_prediction)4 5 # Add each tree's contribution6 for tree in self.trees:7 F += self.learning_rate * tree.predict(X)8 9 return FThe Gradient in Gradient Boosting
Gradient descent in function space
Here's where the “gradient” comes from. Gradient boosting is gradient descent in function space. In normal gradient descent, we minimize a loss L(w) by updating parameters:
Standard gradient descent updates parameters
In gradient boosting, we minimize a loss L(F) by updating the function F:
Gradient boosting updates the function itself
∂L/∂F isn't a number—it's a vector, one value for each training sample. This is the negative gradient of the loss with respect to the predictions. For MSE loss L = ½(y − F)²:
For MSE, the negative gradient equals the residuals
The Key Insight
Interactive: Gradient Descent in Function Space
The gradient tells us how to adjust predictions • Trees approximate this gradient
Different Loss Functions
Beyond MSE
The beauty of gradient boosting is you can use any differentiable loss function. Just compute its gradient with respect to predictions.
Interactive: Loss Functions & Their Gradients
Different loss functions lead to different gradient signals for tree fitting
| Loss | Formula | Negative Gradient | Best For |
|---|---|---|---|
| MSE (L2) | ½(y - F)² | y - F (residuals) | General regression |
| MAE (L1) | |y - F| | sign(y - F) | Robust to outliers |
| Huber | Combo of L1/L2 | Smooth combination | Best of both worlds |
| Log-loss | -[y log(p) + ...] | y - p (for classification) | Binary classification |
Each loss function gives different residuals to fit. MSE focuses on all errors equally. MAE is robust to outliers because large errors don't dominate. Huber smoothly transitions between the two.
Gradient Boosting for Classification
Predicting log-odds
Classification is trickier because labels are discrete (0 or 1), but we need continuous predictions to fit trees to. The trick: predict log-odds instead of probabilities, and use log-loss (binary cross-entropy).
Initialization
Start with the log-odds of the positive class:
Initialize with log-odds of the positive class rate
1class GradientBoostingClassifier:2 def _sigmoid(self, x):3 return 1 / (1 + np.exp(-np.clip(x, -500, 500)))4 5 def fit(self, X, y):6 # Initialize with log-odds7 p_positive = np.mean(y)8 self.initial_prediction = np.log(p_positive / (1 - p_positive))9 F = np.full(len(y), self.initial_prediction)10 11 for m in range(self.n_estimators):12 # Convert log-odds to probabilities13 probabilities = self._sigmoid(F)14 15 # Negative gradient of log-loss = y - p16 residuals = y - probabilities17 18 # Fit tree to negative gradient19 tree = DecisionTreeRegressor(max_depth=self.max_depth)20 tree.fit(X, residuals)21 self.trees.append(tree)22 23 # Update log-odds24 F += self.learning_rate * tree.predict(X)25 26 def predict_proba(self, X):27 F = np.full(len(X), self.initial_prediction)28 for tree in self.trees:29 F += self.learning_rate * tree.predict(X)30 return self._sigmoid(F)31 32 def predict(self, X):33 return (self.predict_proba(X) > 0.5).astype(int)For log-loss L = −[y log(p) + (1−y) log(1−p)], the negative gradient is simply y − p, the difference between true labels and predicted probabilities. This is remarkably similar to MSE residuals!
Why Gradient Boosting Works So Well
The power of sequential error correction
Gradient boosting dominates machine learning competitions and real-world applications for several fundamental reasons:
Sequential Error Correction
Unlike random forests where each tree independently solves the whole problem, gradient boosting trees specialize. First trees learn major patterns, later trees focus on the hardest cases.
Principled Optimization
It's literally minimizing loss by taking steps in the direction of the negative gradient. Any differentiable loss works: ranking, survival analysis, custom business metrics.
Automatic Feature Interactions
Trees naturally capture interactions like "if income > 50k AND age < 30". No manual feature engineering required.
Implicit Regularization
Small learning rates ensure no single tree dominates. Early trees can't overfit because contributions are dampened. The ensemble builds up slowly.
Robust to Scale
Trees make splits based on ordering, not absolute values. No preprocessing needed for different scales, skewed distributions, or mixed types.
Free Feature Importance
Every split provides an information gain. Sum gains for each feature across all trees for a built-in importance measure.
Hyperparameters
The complete guide
Gradient boosting has many hyperparameters, but understanding their interactions makes tuning manageable.
The Core Tradeoff: n_estimators × learning_rate
These two work together to control the ensemble's capacity:
Interactive: Learning Rate & n_estimators Tradeoff
Lower learning rate + more trees = better generalization (with early stopping)
Best Practice
| Parameter | Typical Range | Effect | When to Increase |
|---|---|---|---|
| learning_rate | 0.01 - 0.3 | Step size per tree | Underfitting (with more trees) |
| n_estimators | 100 - 1000 | Number of trees | With lower learning rate |
| max_depth | 3 - 8 | Tree complexity | Complex feature interactions |
| min_samples_leaf | 1 - 50 | Leaf size regularization | Overfitting on noise |
| subsample | 0.5 - 1.0 | Row sampling per tree | Large datasets, overfitting |
| max_features | sqrt(n) - n | Column sampling per split | High-dimensional data |
The Golden Rules
- →Small learning rate + many trees + early stopping almost always beats large learning rate + few trees
- →Shallow trees (depth 3-5) work better than deep trees for most problems
- →Subsampling (0.5-0.8) improves performance for large datasets
- →Cross-validation is essential for finding optimal hyperparameters
Modern Variations: XGBoost, LightGBM, CatBoost
Production-ready implementations
Our implementation is the foundation, but production libraries add many optimizations that make them 10-100x faster and more accurate:
XGBoost
Adds L1/L2 penalties on tree complexity to prevent overfitting
Uses both first and second derivatives (Hessian) for better splits
Handles missing values by learning which direction to send them
Tree building parallelized across features (not trees)
LightGBM
Grows the leaf with maximum gain instead of level-wise. Creates asymmetric trees.
Keeps samples with large gradients, randomly samples small ones. Focuses on hard examples.
Bundles mutually exclusive features together, reducing dimensionality.
Bins continuous features into discrete buckets for much faster splits.
CatBoost
Handles target leakage in categorical features using different permutations
All trees are balanced binary trees, making prediction extremely fast
Encodes categories optimally without manual one-hot encoding
Excellent GPU support for training and inference
Which Library to Use?
| Library | Best For | Strengths | Weaknesses |
|---|---|---|---|
| XGBoost | General purpose, competitions | Most mature, great documentation | Slower than LightGBM on large data |
| LightGBM | Large datasets, speed | Fastest training, memory efficient | Can overfit with default params |
| CatBoost | Categorical features | Best for categoricals, robust defaults | Slower training than LightGBM |
| sklearn | Learning, prototyping | Simple API, well integrated | Much slower, fewer features |
Quick Recommendations
Happy coding, and may your gradients always descend in the right direction.