param | Interactive Machine Learning Education

Neural networks are perhaps the most transformative technology of our generation. They power everything from the voice assistant on your phone to the recommendation engine suggesting what you should watch next. They translate languages, generate art, write code, and diagnose diseases. Yet beneath all this complexity lies a surprisingly elegant idea: layers of simple computational units, each performing basic math, that together can learn to approximate virtually any function.

In this deep dive, we're going to build neural networks from scratch. Not just the forward pass—we'll derive backpropagation, implement various optimizers, understand why initialization matters, and explore the regularization techniques that make deep learning work in practice. By the end, you'll understand not just what neural networks do, but why each component exists and when to use different techniques.

This is going to be comprehensive. Grab some coffee.

Why Neural Networks?

We've already covered linear regression, logistic regression, decision trees, SVMs, and ensemble methods. They're all powerful in their own right. So why do we need neural networks?

The fundamental limitation of traditional models is their representation capacity. Linear models can only learn linear decision boundaries. SVMs with kernels can learn non-linear boundaries, but you have to choose the right kernel. Decision trees can capture complex interactions, but they're fundamentally axis-aligned and struggle with smooth functions.

Neural networks are universal function approximators. Given enough neurons and layers, they can approximate any continuous function to arbitrary precision. More importantly, they learn their own features. Instead of hand-engineering features like “is this pixel bright?” or “does this sentence contain the word free?”, neural networks learn hierarchical representations automatically from raw data.

Feature Learning

Automatically discover relevant features from raw data—no manual feature engineering required.

Hierarchical Representations

Learn simple patterns first, then combine them into increasingly complex concepts.

Universal Approximation

Can theoretically learn any continuous function given enough capacity.

The Deep Learning Revolution

What changed wasn't the theory—neural networks have existed since the 1950s. What changed was compute (GPUs), data (the internet), and a few key algorithmic insights (better initialization, activation functions, and optimizers). These made training deep networks practical.

The Perceptron: Where It All Began

The simplest neural network

Let's start at the very beginning. In 1958, Frank Rosenblatt invented the perceptron, inspired by how biological neurons work. A neuron receives inputs, processes them, and fires (or doesn't) based on whether the combined signal exceeds some threshold.

The perceptron does exactly this mathematically:

ŷ=step(w·x+b)

Weighted sum of inputs, then threshold

If the weighted sum of inputs plus bias exceeds zero, output 1. Otherwise, output 0. This is essentially a linear classifier with a hard threshold.

xInput features (the data we receive)

wWeights (learned parameters that determine importance)

bBias (allows shifting the decision boundary)

ŷOutput prediction (0 or 1)

The perceptron learns through a simple update rule: if it makes a mistake, adjust the weights in the direction that would have given the correct answer.

perceptron.py

1class Perceptron:
2    def __init__(self, learning_rate=0.01, n_iterations=1000):
3        self.lr = learning_rate
4        self.n_iterations = n_iterations
5        self.weights = None
6        self.bias = None
7    
8    def fit(self, X, y):
9        n_samples, n_features = X.shape
10        self.weights = np.zeros(n_features)
11        self.bias = 0
12        
13        for _ in range(self.n_iterations):
14            for idx, x_i in enumerate(X):
15                # Compute prediction
16                linear_output = np.dot(x_i, self.weights) + self.bias
17                y_pred = 1 if linear_output > 0 else 0
18                
19                # Update weights if prediction is wrong
20                update = self.lr * (y[idx] - y_pred)
21                self.weights += update * x_i
22                self.bias += update
23    
24    def predict(self, X):
25        linear_output = np.dot(X, self.weights) + self.bias
26        return np.where(linear_output > 0, 1, 0)

The XOR Problem

In 1969, Minsky and Papert showed that a single perceptron cannot learn the XOR function. This seemingly simple limitation caused the first “AI winter” and nearly killed neural network research. The solution? Multiple layers.

Going Deeper: Multi-Layer Networks

Adding depth to learn complexity

A single perceptron can only learn linearly separable patterns. But stack multiple layers of neurons together, and suddenly you can learn arbitrarily complex functions. This is the multi-layer perceptron (MLP), also called a feedforward neural network.

Network Architecture

Input

Receives raw features. One neuron per input feature.

Hidden

Learns intermediate representations. This is where the magic happens.

Output

Produces final predictions. Structure depends on the task.

Each layer transforms its input through a linear transformation followed by a non-linear activation function. Without the non-linearity, stacking layers would be pointless—multiple linear transformations collapse into a single linear transformation. The activation function is what gives neural networks their power.

a^[l]=g(W^[l]a^[l-1]+b^[l])

Each layer: linear transformation → non-linear activation

The notation can be confusing at first: a^[l] is the activation (output) of layer l, W^[l] is the weight matrix for layer l, and b^[l] is the bias vector. The input layer is l=0, and a^[0] = x (our input data).

Forward Propagation

From input to prediction

Forward propagation is simply computing the output of the network by passing data through each layer sequentially. For each layer, we compute a linear transformation and apply an activation function.

Step 1

Linear Transformation (Pre-activation)

z[l] = W[l] · a[l-1] + b[l]

Compute the weighted sum of inputs plus bias. This is sometimes called the 'logit' or 'pre-activation'.

Step 2

Non-Linear Activation

a[l] = g(z[l])

Apply the activation function element-wise. This introduces non-linearity, allowing the network to learn complex patterns.

Step 3

Repeat

For l = 1, 2, ..., L

Continue through all layers until we reach the output layer.

forward_propagation.py

1class NeuralNetwork:
2    def __init__(self, layer_sizes, activation='relu'):
3        """
4        layer_sizes: list of integers, e.g., [784, 128, 64, 10]
5        """
6        self.layer_sizes = layer_sizes
7        self.num_layers = len(layer_sizes)
8        self.activation = activation
9        
10        # Initialize weights and biases
11        self.weights = []
12        self.biases = []
13        for i in range(1, self.num_layers):
14            W = np.random.randn(layer_sizes[i], layer_sizes[i-1]) * 0.01
15            b = np.zeros((layer_sizes[i], 1))
16            self.weights.append(W)
17            self.biases.append(b)
18    
19    def forward(self, X):
20        """Forward pass through the network."""
21        self.activations = [X]  # Store for backprop
22        self.z_values = []      # Pre-activation values
23        
24        a = X
25        for i in range(len(self.weights) - 1):
26            z = self.weights[i] @ a + self.biases[i]
27            self.z_values.append(z)
28            a = self._activate(z)  # Hidden layers use ReLU/etc
29            self.activations.append(a)
30        
31        # Output layer (no activation or softmax for classification)
32        z = self.weights[-1] @ a + self.biases[-1]
33        self.z_values.append(z)
34        a = self._output_activation(z)
35        self.activations.append(a)
36        
37        return a
38    
39    def _activate(self, z):
40        if self.activation == 'relu':
41            return np.maximum(0, z)
42        elif self.activation == 'sigmoid':
43            return 1 / (1 + np.exp(-z))
44        elif self.activation == 'tanh':
45            return np.tanh(z)

Matrix Dimensions

Keep track of shapes! If layer l-1 has n_l-1 neurons and layer l has n_l neurons, thenW^[l] has shape (n_l, n_l-1). This is a common source of bugs when implementing neural networks from scratch.

Interactive: Neural Network Forward Pass

Adjust inputs to see activations flow through the network

Input Values

x10.50

x20.80

Network Output

0.6451

Input

Hidden

Output

Activation Functions

The non-linear magic

The activation function is what transforms a neural network from a fancy linear model into a universal function approximator. Without non-linearity, no matter how many layers you stack, the result is mathematically equivalent to a single linear transformation. Let's explore the most important activation functions and when to use each.

Sigmoid (Logistic)

σ(z)=1 / (1 + e^−z)

Output range: (0, 1)

✓ Pros

• Smooth, differentiable everywhere
• Output interpretable as probability
• Good for output layer in binary classification

✗ Cons

• Vanishing gradients for large |z|
• Output not zero-centered
• Computationally expensive (exp)

Tanh (Hyperbolic Tangent)

tanh(z)=(e^z − e^−z) / (e^z + e^−z)

Output range: (-1, 1)

✓ Pros

• Zero-centered outputs
• Stronger gradients than sigmoid
• Often works better in hidden layers

✗ Cons

• Still has vanishing gradient problem
• Computationally expensive

ReLU (Rectified Linear Unit)

ReLU(z)=max(0, z)

Output range: [0, ∞)

✓ Pros

• No vanishing gradient for positive inputs
• Computationally efficient
• Sparse activation (biological plausibility)
• De facto standard for hidden layers

✗ Cons

• “Dying ReLU” problem (neurons stuck at 0)
• Not zero-centered
• Not differentiable at z=0

Leaky ReLU & Variants

LeakyReLU(z)=max(αz, z)where α ≈ 0.01

Small negative slope prevents dying neurons

Variants:

• Parametric ReLU (PReLU): α is learned during training
• ELU: α(e^z − 1) for z < 0, smoother than Leaky ReLU
• SELU: Self-normalizing, automatically maintains mean ≈ 0 and var ≈ 1

Softmax (Output Layer)

softmax(z)_i=e^z_i / Σ e^z_j

Converts logits to probability distribution that sums to 1

Used exclusively for the output layer in multi-class classification. Each output represents P(class = i | input), and all outputs sum to 1.

Activation	Use Case	Hidden Layers?	Output Layer?
ReLU	Default choice for most networks	✓ Yes (default)	✗ No
Leaky ReLU	When dying ReLU is a problem	✓ Yes	✗ No
Sigmoid	Binary classification output	✗ Avoid	✓ Binary
Tanh	RNNs, some hidden layers	Sometimes	✗ Rarely
Softmax	Multi-class classification	✗ Never	✓ Multi-class
Linear (none)	Regression output	✗ Never	✓ Regression

The Vanishing Gradient Problem

Sigmoid and tanh “saturate” for large inputs—their gradients approach zero. When you multiply many small gradients during backpropagation, the signal vanishes, making it nearly impossible to train deep networks. ReLU solved this by having gradient = 1 for all positive inputs.

Interactive: Activation Functions Comparison

Select functions to compare their shapes and derivatives

x value0.00

Sigmoid

0.5000

ReLU

0.0000

Try moving x to ±5 and see how sigmoid/tanh saturate while ReLU grows linearly.

Loss Functions

Measuring how wrong we are

The loss function quantifies how far our predictions are from the truth. It's what we minimize during training. The choice of loss function depends on your task and has significant implications for both training dynamics and final performance.

Mean Squared Error (MSE)

For: Regression

L=(1/n) Σ (y − ŷ)²

Penalizes large errors heavily due to squaring

When to use: Standard choice for regression. The squared term makes it differentiable everywhere and penalizes outliers more heavily. Use with linear output activation.

Mean Absolute Error (MAE)

For: Regression (robust to outliers)

L=(1/n) Σ |y − ŷ|

More robust to outliers than MSE

When to use: When your data has outliers that shouldn't dominate training. Gradient is constant (±1), which can cause instability near the minimum.

Binary Cross-Entropy (Log Loss)

For: Binary Classification

L=−(1/n) Σ [y log(ŷ) + (1−y) log(1−ŷ)]

Heavily penalizes confident wrong predictions

When to use: Binary classification with sigmoid output. The logarithm creates a steep gradient when predictions are confident but wrong, enabling faster learning.

Categorical Cross-Entropy

For: Multi-class Classification

L=−(1/n) Σ Σy_c log(ŷ_c)

Generalizes binary cross-entropy to multiple classes

When to use: Multi-class classification with softmax output. Labels should be one-hot encoded. Only the true class contributes to the loss.

Huber Loss (Smooth L1)

For: Regression (best of both worlds)

L=½(y−ŷ)² if |y−ŷ| ≤ δ, else δ|y−ŷ| − ½δ²

MSE for small errors, MAE for large errors

When to use: Regression with potential outliers. Combines the smooth gradients of MSE near zero with the robustness of MAE for large errors.

Loss Function	Task	Output Activation	Key Property
MSE	Regression	Linear	Penalizes large errors heavily
MAE	Regression	Linear	Robust to outliers
Huber	Regression	Linear	Best of MSE and MAE
Binary Cross-Entropy	Binary Classification	Sigmoid	Steep gradients for wrong predictions
Categorical Cross-Entropy	Multi-class	Softmax	Only true class contributes

loss_functions.py

1class LossFunctions:
2    @staticmethod
3    def mse(y_true, y_pred):
4        return np.mean((y_true - y_pred) ** 2)
5    
6    @staticmethod
7    def mse_derivative(y_true, y_pred):
8        return 2 * (y_pred - y_true) / y_true.shape[0]
9    
10    @staticmethod
11    def binary_cross_entropy(y_true, y_pred):
12        eps = 1e-15  # Prevent log(0)
13        y_pred = np.clip(y_pred, eps, 1 - eps)
14        return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
15    
16    @staticmethod
17    def binary_cross_entropy_derivative(y_true, y_pred):
18        eps = 1e-15
19        y_pred = np.clip(y_pred, eps, 1 - eps)
20        return (y_pred - y_true) / (y_pred * (1 - y_pred)) / y_true.shape[0]
21    
22    @staticmethod
23    def categorical_cross_entropy(y_true, y_pred):
24        eps = 1e-15
25        y_pred = np.clip(y_pred, eps, 1 - eps)
26        return -np.mean(np.sum(y_true * np.log(y_pred), axis=1))
27    
28    @staticmethod
29    def softmax_cross_entropy_derivative(y_true, y_pred):
30        # For softmax + cross-entropy, the derivative simplifies beautifully
31        return (y_pred - y_true) / y_true.shape[0]

The Softmax + Cross-Entropy Combo

When using softmax output with categorical cross-entropy loss, the combined gradient simplifies to just ŷ − y. This elegant simplification is one reason this combination is so widely used for classification.

Backpropagation: Learning from Mistakes

The algorithm that makes learning possible

Backpropagation is arguably the most important algorithm in deep learning. Without it, we couldn't train neural networks at all. It answers a deceptively simple question: how should we change each weight to reduce the loss?

A neural network might have millions of weights. Computing how each one affects the loss through numerical approximation (nudging each weight and seeing what happens) would be impossibly slow. Backpropagation computes all gradients in a single backward pass through the network, making training tractable.

The Core Problem

We have a loss function L that measures how wrong our predictions are. We want to find the gradient ∂L/∂W for every weight W in the network. But here's the challenge: a weight in layer 1 affects layer 2, which affects layer 3, and so on. The effect of changing one weight ripples through the entire network.

Backpropagation solves this by systematically tracking how changes propagate through the network using the chain rule from calculus.

The Chain Rule: The Heart of Backprop

The chain rule says: if y depends on u, and u depends on x, then the rate of change of y with respect to xis the product of the intermediate rates of change.

dy/dx=dy/du×du/dx

Intuition: Think of it like currency conversion. To convert dollars to yen, you might go dollars → euros → yen. The total exchange rate is the product of the individual rates. Similarly, to find how L changes with W, we multiply the rates of change through each intermediate variable.

Let's build intuition with a concrete example. Consider a tiny network with just 2 neurons:

Worked Example: A 2-Layer Network

Input → w₁ → z₁ → σ → a₁ → w₂ → z₂ → σ → ŷ → L

Forward pass:

z₁ = w₁ × x + b₁ ← linear transformation

a₁ = σ(z₁) ← activation

z₂ = w₂ × a₁ + b₂

ŷ = σ(z₂)

L = (y − ŷ)² ← MSE loss

Question: How does w₁ affect L?

Chain rule: w₁ → z₁ → a₁ → z₂ → ŷ → L

∂L/∂w₁=∂L/∂ŷ×∂ŷ/∂z₂×∂z₂/∂a₁×∂a₁/∂z₁×∂z₁/∂w₁

Let's compute each piece:

∂L/∂ŷ = 2(ŷ − y)

How does loss change with prediction?

If prediction is too high (ŷ > y), this is positive → we need to decrease ŷ

∂ŷ/∂z₂ = σ'(z₂) = ŷ(1 − ŷ)

How does sigmoid output change with input?

Sigmoid derivative. Largest when ŷ ≈ 0.5, near zero when ŷ ≈ 0 or ŷ ≈ 1

∂z₂/∂a₁ = w₂

How does z₂ change with a₁?

Just the weight! Larger w₂ means a₁ has more influence on z₂

∂a₁/∂z₁ = σ'(z₁) = a₁(1 − a₁)

Sigmoid derivative again

This is where vanishing gradients come from—if a₁ is near 0 or 1, gradient is tiny

∂z₁/∂w₁ = x

How does z₁ change with w₁?

Just the input! If x is large, w₁ has a big impact on z₁

Multiply them all together:

∂L/∂w₁=2(ŷ−y) · ŷ(1−ŷ) · w₂ · a₁(1−a₁) · x

The full gradient through 2 layers

The Intuition

Every term in this product has meaning:
• 2(ŷ−y): How wrong were we? (Error signal)
• ŷ(1−ŷ): How “confident” was the output? (Uncertainty)
• w₂: How much does layer 1 influence layer 2? (Pathway strength)
• a₁(1−a₁): How uncertain was layer 1? (Gradient flow)
• x: What was the actual input? (Input credit)

Notice something beautiful: each layer only needs information from the layer above it. We compute∂L/∂ŷ first, then use it to compute ∂L/∂z₂, then ∂L/∂a₁, and so on. This is why it's called backpropagation—we propagate error signals backward through the network.

The General Backpropagation Algorithm

For a network with L layers, define the “error” at layer l as:

δ^[l]=∂L/∂z^[l]

This is the gradient of the loss with respect to the pre-activation (before applying the activation function). It tells us: “how much does the loss change if we nudge the weighted sum at this layer?”

Step 1

Output Layer Error

δ[L] = ∂L/∂a[L] ⊙ g'(z[L])

Start at the output. The error combines how wrong we were (∂L/∂a) with how sensitive the activation was (g').

💡 For softmax + cross-entropy, this beautifully simplifies to just (ŷ − y)—the difference between prediction and truth.

Step 2

Backpropagate Error

δ[l] = (W[l+1]ᵀ · δ[l+1]) ⊙ g'(z[l])

Each layer receives error from the layer above, weighted by how much it contributed (via W), and scaled by activation sensitivity.

💡 W[l+1]ᵀ distributes 'blame' back to the neurons that contributed most. g'(z[l]) gates whether that neuron can learn.

Step 3

Weight Gradients

∂L/∂W[l] = δ[l] · a[l-1]ᵀ

The gradient for a weight is the error at this layer times the activation that fed into it.

💡 If the input (a[l-1]) was large and the error (δ) is large, this weight contributed a lot to the error—update it more.

Step 4

Bias Gradients

∂L/∂b[l] = δ[l]

Bias gradient is just the error itself (since z = Wa + b, and ∂z/∂b = 1).

💡 The bias shifts everything uniformly, so its gradient is just 'how much should we shift?'

The symbol ⊙ denotes element-wise multiplication (Hadamard product). This is crucial—we're not doing matrix multiplication here, but multiplying corresponding elements. Each neuron's error gets scaled by its own activation derivative.

Why δ · aᵀ for Weight Gradients?

This is often confusing, so let's think about it carefully. We have W with shape (neurons_out, neurons_in), δ with shape (neurons_out, batch_size), and a with shape (neurons_in, batch_size).

The gradient ∂L/∂W_ij tells us: “how much does the loss change if we increase the weight connecting input neuron j to output neuron i?”

The answer is: (error at output neuron i) × (activation at input neuron j). If both are large, this weight is doing a lot of damage and needs a big correction. The outer product δ · a^Tcomputes this for all weight pairs simultaneously.

(neurons_out × batch) @ (batch × neurons_in) = (neurons_out × neurons_in)

The Gradient Flow Problem

Look at the backprop formula again: δ^[l] = (W^[l+1]T · δ^[l+1]) ⊙ g'(z)^[l]

We're multiplying by g'(z) at every layer. For sigmoid, g'(z) is at most 0.25. After 10 layers: 0.25¹⁰ ≈ 0.000001. The gradient has effectively vanished!

Vanishing Gradients

When g'(z) < 1 at each layer, gradients shrink exponentially. Early layers barely learn.

Solution: Use ReLU (g' = 1 for z > 0), residual connections, or better initialization.

Exploding Gradients

If ||W|| > 1 and g'(z) ≥ 1, gradients can grow exponentially. Weights become NaN.

Solution: Gradient clipping, proper initialization, batch normalization.

Here's a complete, annotated implementation of backpropagation:

backpropagation.py

1def backward(self, y_true):
2    """Compute gradients via backpropagation."""
3    m = y_true.shape[1]  # batch size
4    
5    self.dW = [None] * len(self.weights)
6    self.db = [None] * len(self.biases)
7    
8    # Output layer error (for softmax + cross-entropy)
9    delta = self.activations[-1] - y_true
10    
11    # Backpropagate through layers
12    for l in reversed(range(len(self.weights))):
13        a_prev = self.activations[l]
14        
15        # Weight and bias gradients
16        self.dW[l] = (1/m) * (delta @ a_prev.T)
17        self.db[l] = (1/m) * np.sum(delta, axis=1, keepdims=True)
18        
19        # Propagate error to previous layer
20        if l > 0:
21            delta = self.weights[l].T @ delta
22            delta = delta * self._activation_derivative(self.z_values[l-1])
23    
24    return self.dW, self.db
25
26def _activation_derivative(self, z):
27    """Compute derivative of activation function."""
28    if self.activation == 'relu':
29        return (z > 0).astype(float)
30    elif self.activation == 'sigmoid':
31        s = 1 / (1 + np.exp(-z))
32        return s * (1 - s)
33    elif self.activation == 'tanh':
34        return 1 - np.tanh(z) ** 2

Computational Efficiency

Backpropagation is efficient because it reuses computations. Each gradient only requires information from the layer above (which we just computed) and the forward pass values (which we cached). Computing all gradients takes roughly the same time as a single forward pass.

The full training loop puts forward and backward propagation together:

training_loop.py

1def train(self, X, y, epochs=1000, learning_rate=0.01):
2    """Full training loop."""
3    for epoch in range(epochs):
4        # Forward pass
5        y_pred = self.forward(X)
6        
7        # Compute loss
8        loss = self.compute_loss(y, y_pred)
9        
10        # Backward pass
11        dW, db = self.backward(y)
12        
13        # Update parameters
14        for l in range(len(self.weights)):
15            self.weights[l] -= learning_rate * dW[l]
16            self.biases[l] -= learning_rate * db[l]
17        
18        if epoch % 100 == 0:
19            print(f"Epoch {epoch}, Loss: {loss:.4f}")

Interactive: Backpropagation Flow

Watch gradients flow backward through the network

Iteration: 0

Loss: 0.420

→Forward

Input travels through network

✗Error

Compare output to target

←Backward

Error propagates back

✓Update

Adjust weights to reduce error

Backpropagation sends the error signal backward through the network, allowing each weight to know how much it contributed to the mistake — and adjust accordingly.

Optimization Algorithms

Smarter ways to update weights

Vanilla gradient descent works, but it's often slow and can get stuck in local minima or saddle points. Modern optimizers use various techniques to converge faster and more reliably.

SGD with Momentum

Momentum accumulates gradients from previous steps, building up velocity in consistent directions. Think of a ball rolling downhill—it builds up speed and can roll through small bumps and dips.

v=βv+(1−β)∇Lw ← w − α·v

β ≈ 0.9 typically. Smooths out oscillations and accelerates convergence.

When to use: Almost always. Momentum rarely hurts and often helps significantly, especially with noisy gradients or ill-conditioned problems.

RMSprop

RMSprop adapts the learning rate for each parameter based on the history of its gradients. Parameters with large gradients get smaller learning rates; parameters with small gradients get larger learning rates.

s=βs+(1−β)∇L²w ← w − α·∇L/√(s+ε)

Adaptive learning rates based on gradient history

When to use: Good for RNNs and when different parameters need very different learning rates. Generally superseded by Adam in practice.

Adam (Adaptive Moment Estimation)

Adam combines momentum and RMSprop—it maintains both a running average of gradients (momentum) and a running average of squared gradients (adaptive learning rates). It's the default optimizer for most deep learning applications.

m=β₁m+(1−β₁)∇L(momentum)

v=β₂v+(1−β₂)∇L²(adaptive lr)

m̂=m/(1−β₁^t),v̂=v/(1−β₂^t)(bias correction)

w←w−α·m̂/√(v̂+ε)

β₁=0.9, β₂=0.999, ε=1e-8 typically. Bias correction is crucial for early steps.

When to use: Default choice for most problems. Works well out of the box with lr=0.001, β₁=0.9, β₂=0.999.

AdamW (Adam with Decoupled Weight Decay)

AdamW fixes a subtle issue with how Adam handles L2 regularization. In vanilla Adam, L2 regularization is added to the gradient, which interacts poorly with the adaptive learning rate. AdamW applies weight decay directly to the weights instead.

m=β₁m+(1−β₁)∇L(momentum)

v=β₂v+(1−β₂)∇L²(adaptive lr)

m̂=m/(1−β₁^t),v̂=v/(1−β₂^t)(bias correction)

w←w−α·(m̂/√(v̂+ε)+λw)(decoupled decay)

λ (weight decay) applied directly to weights, not through gradient. β₁=0.9, β₂=0.999.

When to use: State-of-the-art for transformers and large models. Preferred over Adam when using weight decay regularization.

Optimizer	Key Idea	Typical LR	Best For
SGD	Basic gradient descent	0.01 - 0.1	Simple problems, fine-tuning
SGD + Momentum	Accumulate gradient history	0.01 - 0.1	Computer vision (CNNs)
RMSprop	Adaptive per-parameter LR	0.001	RNNs
Adam	Momentum + Adaptive LR	0.001	Default for most tasks
AdamW	Adam + Decoupled weight decay	0.001	Transformers, large models

Learning Rate Schedules

Don't use a fixed learning rate! Common schedules include: Step decay (reduce by factor every N epochs), Cosine annealing (smooth decay to 0), Warm-up (start small, ramp up, then decay), and One-cycle (increase then decrease). Warmup is especially important for Adam/AdamW.

Weight Initialization

Starting from the right place

How you initialize weights has a massive impact on training. Initialize too small, and signals vanish. Initialize too large, and signals explode. The goal is to maintain the variance of activations and gradients as they propagate through layers.

The Problem with Bad Initialization

Too Small (e.g., N(0, 0.01))

Activations shrink exponentially with depth. By layer 10, signals are near zero. Gradients vanish and learning stops.

Too Large (e.g., N(0, 1))

Activations grow exponentially with depth. By layer 10, values overflow (NaN). Gradients explode and training diverges.

Xavier/Glorot Initialization

Designed for tanh and sigmoid activations. The variance is set to preserve signal magnitude in both forward and backward passes.

W∼N(0, 2/(n_in + n_out))

For tanh/sigmoid activations

When to use: Tanh or sigmoid activations in hidden layers. Also works reasonably with linear layers.

He/Kaiming Initialization

Modified for ReLU activations, which zero out half the inputs. The variance is doubled to account for this.

W∼N(0, 2/n_in)

For ReLU and variants

When to use: ReLU, Leaky ReLU, and other rectified activations. This is the default for modern deep networks.

initialization.py

1def initialize_weights(layer_sizes, activation='relu'):
2    """Initialize weights using appropriate scheme."""
3    weights, biases = [], []
4    
5    for i in range(1, len(layer_sizes)):
6        fan_in = layer_sizes[i-1]
7        fan_out = layer_sizes[i]
8        
9        if activation in ['relu', 'leaky_relu']:
10            # He initialization
11            std = np.sqrt(2.0 / fan_in)
12        else:
13            # Xavier initialization
14            std = np.sqrt(2.0 / (fan_in + fan_out))
15        
16        W = np.random.randn(fan_out, fan_in) * std
17        b = np.zeros((fan_out, 1))  # Biases typically init to 0
18        
19        weights.append(W)
20        biases.append(b)
21    
22    return weights, biases

Method	Formula	Use With
Xavier/Glorot	Var = 2/(n_in + n_out)	Tanh, Sigmoid
He/Kaiming	Var = 2/n_in	ReLU, Leaky ReLU
LeCun	Var = 1/n_in	SELU
Orthogonal	QR decomposition	RNNs, very deep nets

Biases are typically initialized to zero. For ReLU, some practitioners use small positive values (0.01) to ensure neurons are active initially, but zero usually works fine with proper weight initialization.

Regularization Techniques

Preventing overfitting

Neural networks have millions of parameters and can easily memorize training data instead of learning generalizable patterns. Regularization techniques add constraints or noise to prevent this overfitting.

L2 Regularization (Weight Decay)

Add a penalty proportional to the squared magnitude of weights. This encourages smaller weights, which leads to simpler, more generalizable models.

L_total=L+(λ/2) Σ ||W||²

λ controls regularization strength (typically 1e-4 to 1e-2)

When to use: Almost always as a baseline regularizer. Use AdamW for proper implementation with adaptive optimizers.

L1 Regularization (Lasso)

Penalizes the absolute value of weights. Unlike L2, this encourages sparsity—many weights become exactly zero, effectively performing feature selection.

L_total=L+λ Σ |W|

Promotes sparsity in weights

When to use: When you want interpretable sparse models or automatic feature selection. Less common than L2 in deep learning.

Dropout

During training, randomly set a fraction of neurons to zero. This prevents neurons from co-adapting and forces the network to learn redundant representations. At test time, use all neurons but scale down by the dropout rate.

dropout.py

1def dropout_forward(a, dropout_rate, training=True):
2    """Apply dropout during forward pass."""
3    if not training or dropout_rate == 0:
4        return a, None
5    
6    # Create binary mask
7    keep_prob = 1 - dropout_rate
8    mask = np.random.binomial(1, keep_prob, size=a.shape) / keep_prob
9    
10    # Apply mask (inverted dropout - scale during training)
11    return a * mask, mask
12
13def dropout_backward(da, mask):
14    """Backprop through dropout."""
15    return da * mask if mask is not None else da

When to use: Default regularizer for fully connected layers. Typical rate: 0.5 for hidden layers, 0.2 for input layer. Less common in CNNs (use batch norm instead).

Early Stopping

Monitor validation loss during training. When it stops improving (or starts increasing), stop training and use the best model. This is one of the most effective and simple regularization techniques.

early_stopping.py

1class EarlyStopping:
2    def __init__(self, patience=10, min_delta=0.001):
3        self.patience = patience
4        self.min_delta = min_delta
5        self.best_loss = float('inf')
6        self.counter = 0
7        self.best_weights = None
8    
9    def __call__(self, val_loss, model):
10        if val_loss < self.best_loss - self.min_delta:
11            self.best_loss = val_loss
12            self.counter = 0
13            self.best_weights = model.get_weights()  # Save best
14        else:
15            self.counter += 1
16        
17        if self.counter >= self.patience:
18            model.set_weights(self.best_weights)  # Restore best
19            return True  # Stop training
20        return False

When to use: Always! There's no reason not to use early stopping. Set patience high enough (10-20 epochs) to avoid stopping too early.

Data Augmentation

Create synthetic training examples by applying transformations that preserve labels. For images: rotations, flips, crops, color jittering. For text: synonym replacement, back-translation. Effectively increases dataset size without collecting more data.

Common augmentations:

• Images: Random crop, flip, rotation, color jitter, mixup, cutout
• Text: Synonym replacement, random deletion, back-translation
• Audio: Time stretching, pitch shifting, noise injection
• Tabular: SMOTE, feature noise, mixup

When to use: Whenever you have limited data. One of the most effective ways to improve generalization, especially for images.

Technique	How It Works	Typical Settings
L2 (Weight Decay)	Penalize large weights	λ = 1e-4 to 1e-2
L1	Encourage sparse weights	λ = 1e-5 to 1e-3
Dropout	Randomly zero neurons	p = 0.2 to 0.5
Early Stopping	Stop when val loss stops improving	patience = 10-20
Data Augmentation	Create synthetic training data	Domain-specific

Interactive: Dropout Regularization

Toggle between training and inference to see dropout in action

Training (iter 0)

Dropout Rate50%

Hidden Neurons

10/ 10

0 dropped

During training, 50% of hidden neurons are randomly "dropped" (set to 0) each forward pass. This prevents co-adaptation.

Click "New Random Mask" to see different dropout patterns. Each training step sees a different "thinned" network, effectively training an ensemble of sub-networks.

Normalization Techniques

Keeping activations stable

As data flows through a deep network, the distribution of activations can shift dramatically (this is called “internal covariate shift”). Normalization techniques stabilize these distributions, making training faster and more stable.

Batch Normalization

Normalize activations across the batch dimension. For each feature, compute mean and variance across all samples in the batch, then normalize. Learnable parameters γ and β allow the network to undo the normalization if needed.

x̂=(x − μ_B) / √(σ²_B + ε)y = γx̂ + β

μ_B and σ²_B computed per-feature across the batch

When to use: Standard in CNNs, applied after linear layers and before activations. Allows higher learning rates and acts as regularizer. Needs sufficiently large batch sizes (≥32).

Layer Normalization

Normalize across features for each sample independently. Unlike batch norm, statistics are computed per-sample, so it works with any batch size and is essential for sequence models.

x̂=(x − μ_L) / √(σ²_L + ε)

μ_L and σ²_L computed per-sample across features

When to use: Standard in transformers and RNNs. Works with batch size 1. Essential for sequence modeling where batch statistics don't make sense.

Instance Normalization

Normalize per-sample, per-channel for images. Used in style transfer and GANs where you want to remove style information.

Group Normalization

Divide channels into groups, normalize within each group. Works with small batch sizes. Compromise between batch norm and instance norm.

RMSNorm

Simplified layer norm using only RMS, no mean centering. Faster and works well in large language models (LLMs).

Weight Normalization

Decouple weight magnitude from direction. Reparameterize W = g(v/||v||). Sometimes faster than batch norm.

Method	Normalizes Over	Best For
Batch Norm	Batch dimension	CNNs, large batch sizes
Layer Norm	Feature dimension	Transformers, RNNs
Instance Norm	Spatial dimensions	Style transfer, GANs
Group Norm	Channel groups	Small batch sizes, detection
RMSNorm	Feature dimension (RMS only)	LLMs, efficiency-focused

Placement Matters

Batch norm is typically applied after the linear layer but before the activation function. However, some architectures (especially transformers) apply layer norm before the attention/feedforward layers (pre-norm) rather than after (post-norm).

Practical Considerations

Making it work in the real world

Theory is one thing; getting neural networks to actually work is another. Here are the practical techniques that separate working models from frustrating debugging sessions.

Gradient Clipping

Prevent exploding gradients by capping gradient magnitudes. Essential for RNNs and sometimes helpful for very deep networks or when using large learning rates.

gradient_clipping.py

1def clip_gradients_by_norm(grads, max_norm=1.0):
2    """Clip gradients to maximum norm."""
3    total_norm = np.sqrt(sum(np.sum(g**2) for g in grads))
4    clip_coef = max_norm / (total_norm + 1e-6)
5    
6    if clip_coef < 1:
7        grads = [g * clip_coef for g in grads]
8    
9    return grads
10
11def clip_gradients_by_value(grads, clip_value=1.0):
12    """Clip gradients to [-clip_value, clip_value]."""
13    return [np.clip(g, -clip_value, clip_value) for g in grads]

When to use: Always for RNNs/LSTMs. Use norm clipping (typically max_norm=1.0 or 5.0) rather than value clipping. Monitor gradient norms during training.

Learning Rate Finding

The learning rate is the most important hyperparameter. Too high and training diverges; too low and training takes forever. The LR finder technique: start with a tiny LR and exponentially increase it while recording loss. Plot loss vs. LR and pick a value just before loss starts increasing.

lr_finder.py

1def find_lr(model, train_data, min_lr=1e-7, max_lr=10, steps=100):
2    """Find optimal learning rate using the LR range test."""
3    lrs, losses = [], []
4    lr = min_lr
5    lr_mult = (max_lr / min_lr) ** (1 / steps)
6    
7    for i in range(steps):
8        loss = train_one_batch(model, train_data, lr)
9        lrs.append(lr)
10        losses.append(loss)
11        
12        lr *= lr_mult
13        if loss > 4 * min(losses):  # Stop if loss explodes
14            break
15    
16    # Plot lrs vs losses, pick LR where loss is still decreasing
17    return lrs, losses

Debugging Neural Networks

1Overfit a single batch first. If your model can't memorize 10 samples, something is fundamentally wrong.

2Check your data pipeline. Visualize inputs, verify labels, ensure normalization is correct. Most bugs are in data, not models.

3Monitor gradient norms. If they vanish (→0) or explode (→∞), you have initialization or architecture problems.

4Start simple. Get a small model working first, then scale up. Don't debug a 100-layer network.

5Use gradient checking. Numerically verify your backprop implementation matches finite differences.

Hyperparameter Starting Points

Learning Rate

Adam: 3e-4 to 1e-3. SGD: 0.01 to 0.1. Use finder.

Batch Size

32-256 typical. Larger = faster but may need LR adjustment.

Hidden Size

Start with power of 2: 64, 128, 256, 512. Wider often helps.

Depth

Start shallow (2-3 layers). Add depth if needed.

Beyond Feedforward: CNNs and RNNs

Specialized architectures for different data

The feedforward networks we've discussed so far treat input as a flat vector. But many real-world data types have structure: images have spatial structure, sequences have temporal structure. Specialized architectures exploit this structure.

Convolutional Neural Networks (CNNs)

CNNs are designed for data with grid-like topology, especially images. Instead of fully connected layers, they use convolutions—small filters that slide across the input, detecting local patterns.

Convolution

Local pattern detection. A 3×3 filter detects edges, textures, shapes in local regions.

Pooling

Downsampling. Max pool takes the maximum in each region, providing translation invariance.

Hierarchy

Early layers detect edges; deeper layers detect objects, faces, concepts.

Key properties:

• Parameter sharing: Same filter applied everywhere, drastically reducing parameters
• Translation equivariance: Detects patterns regardless of position
• Local connectivity: Each neuron only sees a small region (receptive field)

Use cases: Image classification, object detection, segmentation, video analysis, and increasingly text processing (1D convolutions).

Recurrent Neural Networks (RNNs)

RNNs process sequential data by maintaining a hidden state that gets updated at each timestep. This allows them to remember information from earlier in the sequence.

h_t=tanh(W_hhh_t-1+W_xhx_t)

Hidden state updated with previous state and current input

Variants:

• LSTM: Gated cells with forget/input/output gates. Solves vanishing gradient problem.
• GRU: Simplified LSTM with fewer parameters. Often works just as well.
• Bidirectional: Process sequence in both directions for context from past and future.

Use cases: Language modeling, machine translation, speech recognition, time series. Note: Transformers have largely replaced RNNs for most NLP tasks.

Transformers (The New Standard)

Transformers have revolutionized deep learning. Instead of recurrence, they use self-attention to relate different positions in a sequence directly. This allows parallel processing and captures long-range dependencies more effectively.

Key components:

• Self-attention: Each position attends to all other positions with learned weights
• Multi-head attention: Multiple attention patterns capture different relationships
• Positional encoding: Injects position information (since there's no recurrence)
• Layer norm + residuals: Enables training very deep models (100+ layers)

Impact: Powers GPT, BERT, and virtually all modern LLMs. Also successful in vision (ViT), protein folding (AlphaFold), and many other domains.

Architecture	Best For	Key Innovation
Feedforward (MLP)	Tabular data, simple tasks	Universal approximation
CNN	Images, spatial data	Local pattern detection, parameter sharing
RNN/LSTM	Sequences, time series	Hidden state captures temporal patterns
Transformer	NLP, increasingly everything	Self-attention, parallel processing

Which Architecture to Use?

Images? Start with CNNs (ResNet, EfficientNet) or Vision Transformers for large datasets. Text/Sequences? Transformers for most tasks; LSTMs if you need low latency or small models. Tabular? Often gradient boosting beats neural nets, but MLPs work too. Unsure? Start simple (MLP), then add inductive biases (convolutions, attention) if needed.

We've covered neural networks from first principles: perceptrons, forward propagation, activation functions, loss functions, backpropagation, optimizers, initialization, regularization, normalization, and architecture variants. This is the foundation that powers everything from simple classifiers to GPT-4.

The field moves fast—new architectures, training techniques, and scaling laws emerge constantly. But the fundamentals we've covered here remain relevant. Understandingwhy things work lets you adapt to new developments and debug when things don't work.

Happy coding, and may your gradients always flow.

Contents

Why Neural Networks?

Feature Learning

Hierarchical Representations

Universal Approximation

The Deep Learning Revolution

The Perceptron: Where It All Began

The XOR Problem

Going Deeper: Multi-Layer Networks

Network Architecture

Forward Propagation

Linear Transformation (Pre-activation)

Non-Linear Activation

Repeat

Matrix Dimensions

Interactive: Neural Network Forward Pass

Activation Functions

Sigmoid (Logistic)

Tanh (Hyperbolic Tangent)

ReLU (Rectified Linear Unit)

Leaky ReLU & Variants

Softmax (Output Layer)

The Vanishing Gradient Problem

Interactive: Activation Functions Comparison

Loss Functions

Mean Squared Error (MSE)

Mean Absolute Error (MAE)

Binary Cross-Entropy (Log Loss)

Categorical Cross-Entropy

Huber Loss (Smooth L1)

The Softmax + Cross-Entropy Combo

Backpropagation: Learning from Mistakes

The Core Problem

The Chain Rule: The Heart of Backprop

Worked Example: A 2-Layer Network

The Intuition

The General Backpropagation Algorithm

Output Layer Error

Backpropagate Error

Weight Gradients

Bias Gradients

Why δ · aᵀ for Weight Gradients?

The Gradient Flow Problem

Computational Efficiency

Interactive: Backpropagation Flow

Optimization Algorithms

SGD with Momentum

RMSprop

Adam (Adaptive Moment Estimation)

AdamW (Adam with Decoupled Weight Decay)

Learning Rate Schedules

Weight Initialization

The Problem with Bad Initialization

Xavier/Glorot Initialization

He/Kaiming Initialization

Regularization Techniques

L2 Regularization (Weight Decay)

L1 Regularization (Lasso)

Dropout

Early Stopping

Data Augmentation

Interactive: Dropout Regularization

Normalization Techniques

Batch Normalization

Layer Normalization

Instance Normalization

Group Normalization

RMSNorm

Weight Normalization

Placement Matters

Practical Considerations

Gradient Clipping

Learning Rate Finding

Debugging Neural Networks

Hyperparameter Starting Points

Beyond Feedforward: CNNs and RNNs

Convolutional Neural Networks (CNNs)

Recurrent Neural Networks (RNNs)

Transformers (The New Standard)

Which Architecture to Use?