param | Interactive Machine Learning Education

Random forests taught us the power of ensembles. Combine many weak learners and you get a strong learner. But random forests build trees independently and average them. What if instead of building trees in parallel, we built them sequentially, where each new tree corrects the mistakes of all previous trees?

That's gradient boosting. It's one of the most powerful machine learning algorithms, dominating competitions and real-world applications. XGBoost, LightGBM, and CatBoost are all variations of this core idea. Today we're building gradient boosting from scratch to understand exactly how it works—from the initial prediction to the sequential tree fitting to the final ensemble.

This is going to be math-heavy, but I'll make it as intuitive as possible. By the end, you'll understand why gradient boosting is called “gradient” boosting and how it's essentially gradient descent in function space.

Interactive: Bagging vs Boosting

Compare the two fundamental ensemble approaches

Bagging (Random Forests)

Trees are built independently and in parallel. Each tree sees a different bootstrap sample of the data.

→Each tree tries to solve the entire problem

→Final prediction: average or majority vote

→Reduces variance (overfitting)

→Embarrassingly parallelizable

The Core Idea: Learning from Mistakes

Imagine you're trying to predict house prices. Your first model predicts $200k for every house (just using the mean). Obviously this is terrible—some houses should be $100k, others $500k.

Instead of starting over with a completely different model, what if you built a second model that predicts the errors of the first model? Then your combined prediction would be:

Prediction=Model₁+Model₂(errors)

Each new model predicts what previous models got wrong

The second model doesn't try to predict house prices directly. It predicts “how much did the first model miss by?” If the first model predicted $200k for a house actually worth $300k, the second model learns to predict +$100k for that type of house.

Now you have a better model. But it's still not perfect. So you add a third model that predicts the errors of (Model₁ + Model₂). And a fourth that predicts the errors of (Model₁ + Model₂ + Model₃). Each model incrementally improves the ensemble by focusing on what previous models got wrong.

Interactive: Sequential Residual Fitting

Each tree fits the residuals (errors) of the previous ensemble

Ensemble Prediction F(x) — Iteration 0

Iteration

F₀ = mean(y)

Mean Squared Error

429.1

Initial error

Learning Rate

ν = 0.5

F₀(x) = mean(y) ≈ 51 — Starting with the simplest possible prediction: the average of all target values. Notice how the horizontal line misses both low and high values.

This is Boosting

When the “models” are decision trees and you use gradient descent principles to determine what to fit, it's gradient boosting. The sequential error correction is what makes boosting different from bagging (random forests).

Why Decision Trees?

Gradient boosting can work with any base learner—linear models, neural networks, even other ensembles—but decision trees are the standard choice. Why?

Non-linearity

Trees handle complex interactions without feature engineering

Weak by Design

Shallow trees (depth 3-5) underfit on their own—perfect for boosting

Fast Training

Trees train quickly, making sequential training feasible

Mixed Data Types

No need to encode categoricals or scale features

In gradient boosting, we typically use shallow regression trees(even for classification, as we'll see). These are stumps or small trees that output a single numeric value at each leaf. Deep trees would overfit and wouldn't ensemble well.

Gradient Boosting for Regression

Step-by-step implementation

The algorithm proceeds in steps. First, we initialize with a constant prediction:

Step 0

Initialize with Mean

F₀(x) = mean(y)

Start with the simplest possible prediction: the average of all target values. This minimizes MSE for a constant prediction.

gradient_boosting_init.py

1def fit(self, X, y):
2    self.initial_prediction = np.mean(y)
3    F = np.full(len(y), self.initial_prediction)  # Current predictions
4    self.trees = []
5    
6    for m in range(self.n_estimators):
7        # Step 1: Compute residuals
8        residuals = y - F
9        
10        # Step 2: Fit tree to residuals
11        tree = DecisionTreeRegressor(max_depth=self.max_depth)
12        tree.fit(X, residuals)
13        self.trees.append(tree)
14        
15        # Step 3: Update predictions
16        predictions = tree.predict(X)
17        F += self.learning_rate * predictions

Let's break down what happens at each iteration:

Step 1

Compute Residuals

rᵢ = yᵢ - Fₘ₋₁(xᵢ)

The residual is how far off our current prediction is. For MSE loss, the negative gradient is exactly the residual.

Step 2

Fit Tree to Residuals

hₘ(x) ← fit(X, residuals)

Train a regression tree where the target is the residuals, not the original y. This tree learns "how much should I add to fix errors?"

Step 3

Update the Ensemble

Fₘ(x) = Fₘ₋₁(x) + ν · hₘ(x)

Add the tree's predictions, scaled by the learning rate ν, to our running ensemble prediction.

After M iterations, the final ensemble prediction is:

F(x)=F₀(x)+ν·Σᵐhₘ(x)

Sum of all tree corrections, scaled by learning rate

predict.py

1def predict(self, X):
2    # Start with the initial mean prediction
3    F = np.full(len(X), self.initial_prediction)
4    
5    # Add each tree's contribution
6    for tree in self.trees:
7        F += self.learning_rate * tree.predict(X)
8    
9    return F

The Gradient in Gradient Boosting

Gradient descent in function space

Here's where the “gradient” comes from. Gradient boosting is gradient descent in function space. In normal gradient descent, we minimize a loss L(w) by updating parameters:

w←w−α·∂L/∂w

Standard gradient descent updates parameters

In gradient boosting, we minimize a loss L(F) by updating the function F:

F←F−α·∂L/∂F

Gradient boosting updates the function itself

∂L/∂F isn't a number—it's a vector, one value for each training sample. This is the negative gradient of the loss with respect to the predictions. For MSE loss L = ½(y − F)²:

∂L/∂F=−(y−F)=−residuals

For MSE, the negative gradient equals the residuals

The Key Insight

When we fit a tree to residuals, we're fitting a tree to the negative gradient. Then we take a step in that direction (with step size = learning rate). This is literally gradient descent, just in function space instead of parameter space!

Interactive: Gradient Descent in Function Space

The gradient tells us how to adjust predictions • Trees approximate this gradient

Step 1: Start with Mean

Initialize F₀(x) = mean(y). This flat line is our starting prediction.

Step 2: Compute the Gradient

For each point, the gradient tells us: which direction should F(x) move?

Step 3: Tree Fits the Gradient

We fit a tree to approximate these gradient values across all x.

Step 4: Update F(x)

Add the tree's output to our prediction: F₁ = F₀ + ν·h₁(x)

Step 1: Start with Mean: The gap between targets and prediction = what we need to fix

Traditional GD

w ← w − α∂L/∂w

Function Space GD

F(x) ← F(x) − α∂L/∂F

Different Loss Functions

Beyond MSE

The beauty of gradient boosting is you can use any differentiable loss function. Just compute its gradient with respect to predictions.

Interactive: Loss Functions & Their Gradients

Different loss functions lead to different gradient signals for tree fitting

Loss Value

Negative Gradient −∂L/∂F (what trees fit)

MSE — The negative gradient equals the residual. Large errors get large corrections, making it sensitive to outliers.

Loss	Formula	Negative Gradient	Best For
MSE (L2)	½(y - F)²	y - F (residuals)	General regression
MAE (L1)	\|y - F\|	sign(y - F)	Robust to outliers
Huber	Combo of L1/L2	Smooth combination	Best of both worlds
Log-loss	-[y log(p) + ...]	y - p (for classification)	Binary classification

Each loss function gives different residuals to fit. MSE focuses on all errors equally. MAE is robust to outliers because large errors don't dominate. Huber smoothly transitions between the two.

Gradient Boosting for Classification

Predicting log-odds

Classification is trickier because labels are discrete (0 or 1), but we need continuous predictions to fit trees to. The trick: predict log-odds instead of probabilities, and use log-loss (binary cross-entropy).

Initialization

Start with the log-odds of the positive class:

F₀(x)=log(p/(1−p))

Initialize with log-odds of the positive class rate

gradient_boosting_classifier.py

1class GradientBoostingClassifier:
2    def _sigmoid(self, x):
3        return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
4    
5    def fit(self, X, y):
6        # Initialize with log-odds
7        p_positive = np.mean(y)
8        self.initial_prediction = np.log(p_positive / (1 - p_positive))
9        F = np.full(len(y), self.initial_prediction)
10        
11        for m in range(self.n_estimators):
12            # Convert log-odds to probabilities
13            probabilities = self._sigmoid(F)
14            
15            # Negative gradient of log-loss = y - p
16            residuals = y - probabilities
17            
18            # Fit tree to negative gradient
19            tree = DecisionTreeRegressor(max_depth=self.max_depth)
20            tree.fit(X, residuals)
21            self.trees.append(tree)
22            
23            # Update log-odds
24            F += self.learning_rate * tree.predict(X)
25    
26    def predict_proba(self, X):
27        F = np.full(len(X), self.initial_prediction)
28        for tree in self.trees:
29            F += self.learning_rate * tree.predict(X)
30        return self._sigmoid(F)
31    
32    def predict(self, X):
33        return (self.predict_proba(X) > 0.5).astype(int)

For log-loss L = −[y log(p) + (1−y) log(1−p)], the negative gradient is simply y − p, the difference between true labels and predicted probabilities. This is remarkably similar to MSE residuals!

We're doing gradient descent on log-loss in log-odds space. Each tree provides a direction to move in that space. The sigmoid function ensures our final probabilities are between 0 and 1.

Why Gradient Boosting Works So Well

The power of sequential error correction

Gradient boosting dominates machine learning competitions and real-world applications for several fundamental reasons:

Sequential Error Correction

Unlike random forests where each tree independently solves the whole problem, gradient boosting trees specialize. First trees learn major patterns, later trees focus on the hardest cases.

Principled Optimization

It's literally minimizing loss by taking steps in the direction of the negative gradient. Any differentiable loss works: ranking, survival analysis, custom business metrics.

Automatic Feature Interactions

Trees naturally capture interactions like "if income > 50k AND age < 30". No manual feature engineering required.

Implicit Regularization

Small learning rates ensure no single tree dominates. Early trees can't overfit because contributions are dampened. The ensemble builds up slowly.

Robust to Scale

Trees make splits based on ordering, not absolute values. No preprocessing needed for different scales, skewed distributions, or mixed types.

Free Feature Importance

Every split provides an information gain. Sum gains for each feature across all trees for a built-in importance measure.

Hyperparameters

The complete guide

Gradient boosting has many hyperparameters, but understanding their interactions makes tuning manageable.

The Core Tradeoff: n_estimators × learning_rate

These two work together to control the ensemble's capacity:

learning_rate (ν)

How much each tree contributes. Smaller = each tree has less influence.

n_estimators

How many trees to build. More trees = more capacity to fit complex patterns.

capacity ≈ n_estimators × learning_rate

Interactive: Learning Rate & n_estimators Tradeoff

Lower learning rate + more trees = better generalization (with early stopping)

Learning Rate (ν)0.10

0.01 (conservative)1.0 (aggressive)

Number of Trees100

10 (few)500 (many)

Final Train Loss

0.74

Lower is better

Status

Good Balance

Tip: Use a small learning rate (0.01-0.1) with many trees and early stopping. This typically generalizes better than a high learning rate with few trees.

Best Practice

Start with learning_rate=0.1 and n_estimators=100. If overfitting, decrease learning rate to 0.01-0.03 and increase trees to 500-1000. Use early stopping on a validation set to find the optimal number automatically.

Parameter	Typical Range	Effect	When to Increase
learning_rate	0.01 - 0.3	Step size per tree	Underfitting (with more trees)
n_estimators	100 - 1000	Number of trees	With lower learning rate
max_depth	3 - 8	Tree complexity	Complex feature interactions
min_samples_leaf	1 - 50	Leaf size regularization	Overfitting on noise
subsample	0.5 - 1.0	Row sampling per tree	Large datasets, overfitting
max_features	sqrt(n) - n	Column sampling per split	High-dimensional data

The Golden Rules

→Small learning rate + many trees + early stopping almost always beats large learning rate + few trees
→Shallow trees (depth 3-5) work better than deep trees for most problems
→Subsampling (0.5-0.8) improves performance for large datasets
→Cross-validation is essential for finding optimal hyperparameters

Modern Variations: XGBoost, LightGBM, CatBoost

Production-ready implementations

Our implementation is the foundation, but production libraries add many optimizations that make them 10-100x faster and more accurate:

XGBoost

Regularized Objective

Adds L1/L2 penalties on tree complexity to prevent overfitting

Second-Order Approximation

Uses both first and second derivatives (Hessian) for better splits

Sparsity-Aware Splits

Handles missing values by learning which direction to send them

Parallelization

Tree building parallelized across features (not trees)

LightGBM

Leaf-wise Growth

Grows the leaf with maximum gain instead of level-wise. Creates asymmetric trees.

GOSS Sampling

Keeps samples with large gradients, randomly samples small ones. Focuses on hard examples.

Exclusive Feature Bundling

Bundles mutually exclusive features together, reducing dimensionality.

Histogram-based Splitting

Bins continuous features into discrete buckets for much faster splits.

CatBoost

Ordered Boosting

Handles target leakage in categorical features using different permutations

Symmetric Trees

All trees are balanced binary trees, making prediction extremely fast

Native Categorical Support

Encodes categories optimally without manual one-hot encoding

GPU Acceleration

Excellent GPU support for training and inference

Which Library to Use?

Library	Best For	Strengths	Weaknesses
XGBoost	General purpose, competitions	Most mature, great documentation	Slower than LightGBM on large data
LightGBM	Large datasets, speed	Fastest training, memory efficient	Can overfit with default params
CatBoost	Categorical features	Best for categoricals, robust defaults	Slower training than LightGBM
sklearn	Learning, prototyping	Simple API, well integrated	Much slower, fewer features

Quick Recommendations

Starting Out?

Use XGBoost. Best docs and community.

Big Data?

Use LightGBM. 10x faster on large datasets.

Lots of Categories?

Use CatBoost. Handles them natively.

Happy coding, and may your gradients always descend in the right direction.

Contents