Back to Lessons
superviseddeep-learningclassificationregression

Neural Networks

The architecture of intelligence. Backpropagation and beyond.

Written byOmansh
25 min read
From Scratch
Source Code

Neural networks are perhaps the most transformative technology of our generation. They power everything from the voice assistant on your phone to the recommendation engine suggesting what you should watch next. They translate languages, generate art, write code, and diagnose diseases. Yet beneath all this complexity lies a surprisingly elegant idea: layers of simple computational units, each performing basic math, that together can learn to approximate virtually any function.

In this deep dive, we're going to build neural networks from scratch. Not just the forward pass—we'll derive backpropagation, implement various optimizers, understand why initialization matters, and explore the regularization techniques that make deep learning work in practice. By the end, you'll understand not just what neural networks do, but why each component exists and when to use different techniques.

This is going to be comprehensive. Grab some coffee.

Why Neural Networks?

We've already covered linear regression, logistic regression, decision trees, SVMs, and ensemble methods. They're all powerful in their own right. So why do we need neural networks?

The fundamental limitation of traditional models is their representation capacity. Linear models can only learn linear decision boundaries. SVMs with kernels can learn non-linear boundaries, but you have to choose the right kernel. Decision trees can capture complex interactions, but they're fundamentally axis-aligned and struggle with smooth functions.

Neural networks are universal function approximators. Given enough neurons and layers, they can approximate any continuous function to arbitrary precision. More importantly, they learn their own features. Instead of hand-engineering features like “is this pixel bright?” or “does this sentence contain the word free?”, neural networks learn hierarchical representations automatically from raw data.

Feature Learning

Automatically discover relevant features from raw data—no manual feature engineering required.

Hierarchical Representations

Learn simple patterns first, then combine them into increasingly complex concepts.

Universal Approximation

Can theoretically learn any continuous function given enough capacity.

The Deep Learning Revolution

What changed wasn't the theory—neural networks have existed since the 1950s. What changed was compute (GPUs), data (the internet), and a few key algorithmic insights (better initialization, activation functions, and optimizers). These made training deep networks practical.

The Perceptron: Where It All Began

The simplest neural network

Let's start at the very beginning. In 1958, Frank Rosenblatt invented the perceptron, inspired by how biological neurons work. A neuron receives inputs, processes them, and fires (or doesn't) based on whether the combined signal exceeds some threshold.

The perceptron does exactly this mathematically:

ŷ=step(w·x+b)

Weighted sum of inputs, then threshold

If the weighted sum of inputs plus bias exceeds zero, output 1. Otherwise, output 0. This is essentially a linear classifier with a hard threshold.

xInput features (the data we receive)
wWeights (learned parameters that determine importance)
bBias (allows shifting the decision boundary)
ŷOutput prediction (0 or 1)

The perceptron learns through a simple update rule: if it makes a mistake, adjust the weights in the direction that would have given the correct answer.

perceptron.py
1class Perceptron:
2 def __init__(self, learning_rate=0.01, n_iterations=1000):
3 self.lr = learning_rate
4 self.n_iterations = n_iterations
5 self.weights = None
6 self.bias = None
7
8 def fit(self, X, y):
9 n_samples, n_features = X.shape
10 self.weights = np.zeros(n_features)
11 self.bias = 0
12
13 for _ in range(self.n_iterations):
14 for idx, x_i in enumerate(X):
15 # Compute prediction
16 linear_output = np.dot(x_i, self.weights) + self.bias
17 y_pred = 1 if linear_output > 0 else 0
18
19 # Update weights if prediction is wrong
20 update = self.lr * (y[idx] - y_pred)
21 self.weights += update * x_i
22 self.bias += update
23
24 def predict(self, X):
25 linear_output = np.dot(X, self.weights) + self.bias
26 return np.where(linear_output > 0, 1, 0)

The XOR Problem

In 1969, Minsky and Papert showed that a single perceptron cannot learn the XOR function. This seemingly simple limitation caused the first “AI winter” and nearly killed neural network research. The solution? Multiple layers.

Going Deeper: Multi-Layer Networks

Adding depth to learn complexity

A single perceptron can only learn linearly separable patterns. But stack multiple layers of neurons together, and suddenly you can learn arbitrarily complex functions. This is the multi-layer perceptron (MLP), also called a feedforward neural network.

Network Architecture

Input

Receives raw features. One neuron per input feature.

Hidden

Learns intermediate representations. This is where the magic happens.

Output

Produces final predictions. Structure depends on the task.

Each layer transforms its input through a linear transformation followed by a non-linear activation function. Without the non-linearity, stacking layers would be pointless—multiple linear transformations collapse into a single linear transformation. The activation function is what gives neural networks their power.

a[l]=g(W[l]a[l-1]+b[l])

Each layer: linear transformation → non-linear activation

The notation can be confusing at first: a[l] is the activation (output) of layer l, W[l] is the weight matrix for layer l, and b[l] is the bias vector. The input layer is l=0, and a[0] = x (our input data).

Forward Propagation

From input to prediction

Forward propagation is simply computing the output of the network by passing data through each layer sequentially. For each layer, we compute a linear transformation and apply an activation function.

Step 1
Linear Transformation (Pre-activation)
z[l] = W[l] · a[l-1] + b[l]

Compute the weighted sum of inputs plus bias. This is sometimes called the 'logit' or 'pre-activation'.

Step 2
Non-Linear Activation
a[l] = g(z[l])

Apply the activation function element-wise. This introduces non-linearity, allowing the network to learn complex patterns.

Step 3
Repeat
For l = 1, 2, ..., L

Continue through all layers until we reach the output layer.

forward_propagation.py
1class NeuralNetwork:
2 def __init__(self, layer_sizes, activation='relu'):
3 """
4 layer_sizes: list of integers, e.g., [784, 128, 64, 10]
5 """
6 self.layer_sizes = layer_sizes
7 self.num_layers = len(layer_sizes)
8 self.activation = activation
9
10 # Initialize weights and biases
11 self.weights = []
12 self.biases = []
13 for i in range(1, self.num_layers):
14 W = np.random.randn(layer_sizes[i], layer_sizes[i-1]) * 0.01
15 b = np.zeros((layer_sizes[i], 1))
16 self.weights.append(W)
17 self.biases.append(b)
18
19 def forward(self, X):
20 """Forward pass through the network."""
21 self.activations = [X] # Store for backprop
22 self.z_values = [] # Pre-activation values
23
24 a = X
25 for i in range(len(self.weights) - 1):
26 z = self.weights[i] @ a + self.biases[i]
27 self.z_values.append(z)
28 a = self._activate(z) # Hidden layers use ReLU/etc
29 self.activations.append(a)
30
31 # Output layer (no activation or softmax for classification)
32 z = self.weights[-1] @ a + self.biases[-1]
33 self.z_values.append(z)
34 a = self._output_activation(z)
35 self.activations.append(a)
36
37 return a
38
39 def _activate(self, z):
40 if self.activation == 'relu':
41 return np.maximum(0, z)
42 elif self.activation == 'sigmoid':
43 return 1 / (1 + np.exp(-z))
44 elif self.activation == 'tanh':
45 return np.tanh(z)

Matrix Dimensions

Keep track of shapes! If layer l-1 has nl-1 neurons and layer l has nl neurons, thenW[l] has shape (nl, nl-1). This is a common source of bugs when implementing neural networks from scratch.

Interactive: Neural Network Forward Pass

Adjust inputs to see activations flow through the network

0.500.800.000.870.000.000.840.000.000.65InputHidden 1Hidden 2Output
Input Values
x10.50
x20.80
Network Output
0.6451
Input
Hidden
Output

Activation Functions

The non-linear magic

The activation function is what transforms a neural network from a fancy linear model into a universal function approximator. Without non-linearity, no matter how many layers you stack, the result is mathematically equivalent to a single linear transformation. Let's explore the most important activation functions and when to use each.

Sigmoid (Logistic)

σ(z)=1 / (1 + e−z)

Output range: (0, 1)

✓ Pros
  • • Smooth, differentiable everywhere
  • • Output interpretable as probability
  • • Good for output layer in binary classification
✗ Cons
  • • Vanishing gradients for large |z|
  • • Output not zero-centered
  • • Computationally expensive (exp)

Tanh (Hyperbolic Tangent)

tanh(z)=(ez − e−z) / (ez + e−z)

Output range: (-1, 1)

✓ Pros
  • • Zero-centered outputs
  • • Stronger gradients than sigmoid
  • • Often works better in hidden layers
✗ Cons
  • • Still has vanishing gradient problem
  • • Computationally expensive

ReLU (Rectified Linear Unit)

ReLU(z)=max(0, z)

Output range: [0, ∞)

✓ Pros
  • • No vanishing gradient for positive inputs
  • • Computationally efficient
  • • Sparse activation (biological plausibility)
  • • De facto standard for hidden layers
✗ Cons
  • • “Dying ReLU” problem (neurons stuck at 0)
  • • Not zero-centered
  • • Not differentiable at z=0

Leaky ReLU & Variants

LeakyReLU(z)=max(αz, z)where α ≈ 0.01

Small negative slope prevents dying neurons

Variants:
  • Parametric ReLU (PReLU): α is learned during training
  • ELU: α(ez − 1) for z < 0, smoother than Leaky ReLU
  • SELU: Self-normalizing, automatically maintains mean ≈ 0 and var ≈ 1

Softmax (Output Layer)

softmax(z)i=ezi / Σ ezj

Converts logits to probability distribution that sums to 1

Used exclusively for the output layer in multi-class classification. Each output represents P(class = i | input), and all outputs sum to 1.

ActivationUse CaseHidden Layers?Output Layer?
ReLUDefault choice for most networks✓ Yes (default)✗ No
Leaky ReLUWhen dying ReLU is a problem✓ Yes✗ No
SigmoidBinary classification output✗ Avoid✓ Binary
TanhRNNs, some hidden layersSometimes✗ Rarely
SoftmaxMulti-class classification✗ Never✓ Multi-class
Linear (none)Regression output✗ Never✓ Regression

The Vanishing Gradient Problem

Sigmoid and tanh “saturate” for large inputs—their gradients approach zero. When you multiply many small gradients during backpropagation, the signal vanishes, making it nearly impossible to train deep networks. ReLU solved this by having gradient = 1 for all positive inputs.

Interactive: Activation Functions Comparison

Select functions to compare their shapes and derivatives

xf(x)-101-4-2024
x value0.00
Sigmoid
0.5000
ReLU
0.0000

Try moving x to ±5 and see how sigmoid/tanh saturate while ReLU grows linearly.

Loss Functions

Measuring how wrong we are

The loss function quantifies how far our predictions are from the truth. It's what we minimize during training. The choice of loss function depends on your task and has significant implications for both training dynamics and final performance.

Mean Squared Error (MSE)

For: Regression

L=(1/n) Σ (yŷ

Penalizes large errors heavily due to squaring

When to use: Standard choice for regression. The squared term makes it differentiable everywhere and penalizes outliers more heavily. Use with linear output activation.

Mean Absolute Error (MAE)

For: Regression (robust to outliers)

L=(1/n) Σ |yŷ|

More robust to outliers than MSE

When to use: When your data has outliers that shouldn't dominate training. Gradient is constant (±1), which can cause instability near the minimum.

Binary Cross-Entropy (Log Loss)

For: Binary Classification

L=−(1/n) Σ [y log(ŷ) + (1−y) log(1−ŷ)]

Heavily penalizes confident wrong predictions

When to use: Binary classification with sigmoid output. The logarithm creates a steep gradient when predictions are confident but wrong, enabling faster learning.

Categorical Cross-Entropy

For: Multi-class Classification

L=−(1/n) Σ Σyc log(ŷc)

Generalizes binary cross-entropy to multiple classes

When to use: Multi-class classification with softmax output. Labels should be one-hot encoded. Only the true class contributes to the loss.

Huber Loss (Smooth L1)

For: Regression (best of both worlds)

L=½(y−ŷ)² if |y−ŷ| ≤ δ, else δ|y−ŷ| − ½δ²

MSE for small errors, MAE for large errors

When to use: Regression with potential outliers. Combines the smooth gradients of MSE near zero with the robustness of MAE for large errors.

Loss FunctionTaskOutput ActivationKey Property
MSERegressionLinearPenalizes large errors heavily
MAERegressionLinearRobust to outliers
HuberRegressionLinearBest of MSE and MAE
Binary Cross-EntropyBinary ClassificationSigmoidSteep gradients for wrong predictions
Categorical Cross-EntropyMulti-classSoftmaxOnly true class contributes
loss_functions.py
1class LossFunctions:
2 @staticmethod
3 def mse(y_true, y_pred):
4 return np.mean((y_true - y_pred) ** 2)
5
6 @staticmethod
7 def mse_derivative(y_true, y_pred):
8 return 2 * (y_pred - y_true) / y_true.shape[0]
9
10 @staticmethod
11 def binary_cross_entropy(y_true, y_pred):
12 eps = 1e-15 # Prevent log(0)
13 y_pred = np.clip(y_pred, eps, 1 - eps)
14 return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
15
16 @staticmethod
17 def binary_cross_entropy_derivative(y_true, y_pred):
18 eps = 1e-15
19 y_pred = np.clip(y_pred, eps, 1 - eps)
20 return (y_pred - y_true) / (y_pred * (1 - y_pred)) / y_true.shape[0]
21
22 @staticmethod
23 def categorical_cross_entropy(y_true, y_pred):
24 eps = 1e-15
25 y_pred = np.clip(y_pred, eps, 1 - eps)
26 return -np.mean(np.sum(y_true * np.log(y_pred), axis=1))
27
28 @staticmethod
29 def softmax_cross_entropy_derivative(y_true, y_pred):
30 # For softmax + cross-entropy, the derivative simplifies beautifully
31 return (y_pred - y_true) / y_true.shape[0]

The Softmax + Cross-Entropy Combo

When using softmax output with categorical cross-entropy loss, the combined gradient simplifies to just ŷy. This elegant simplification is one reason this combination is so widely used for classification.

Backpropagation: Learning from Mistakes

The algorithm that makes learning possible

Backpropagation is arguably the most important algorithm in deep learning. Without it, we couldn't train neural networks at all. It answers a deceptively simple question: how should we change each weight to reduce the loss?

A neural network might have millions of weights. Computing how each one affects the loss through numerical approximation (nudging each weight and seeing what happens) would be impossibly slow. Backpropagation computes all gradients in a single backward pass through the network, making training tractable.

The Core Problem

We have a loss function L that measures how wrong our predictions are. We want to find the gradient ∂L/∂W for every weight W in the network. But here's the challenge: a weight in layer 1 affects layer 2, which affects layer 3, and so on. The effect of changing one weight ripples through the entire network.

Backpropagation solves this by systematically tracking how changes propagate through the network using the chain rule from calculus.

The Chain Rule: The Heart of Backprop

The chain rule says: if y depends on u, and u depends on x, then the rate of change of y with respect to xis the product of the intermediate rates of change.

dy/dx=dy/du×du/dx

Intuition: Think of it like currency conversion. To convert dollars to yen, you might go dollars → euros → yen. The total exchange rate is the product of the individual rates. Similarly, to find how L changes with W, we multiply the rates of change through each intermediate variable.

Let's build intuition with a concrete example. Consider a tiny network with just 2 neurons:

Worked Example: A 2-Layer Network

Input → w₁z₁σa₁w₂z₂σŷL
Forward pass:
z₁ = w₁ × x + b₁ ← linear transformation
a₁ = σ(z₁) ← activation
z₂ = w₂ × a₁ + b₂
ŷ = σ(z₂)
L = (yŷ← MSE loss
Question: How does w₁ affect L?
Chain rule: w₁z₁a₁z₂ŷL
∂L/∂w₁=∂L/∂ŷ×∂ŷ/∂z₂×∂z₂/∂a₁×∂a₁/∂z₁×∂z₁/∂w₁

Let's compute each piece:

∂L/∂ŷ = 2(ŷ − y)

How does loss change with prediction?

If prediction is too high (ŷ > y), this is positive → we need to decrease ŷ

∂ŷ/∂z₂ = σ'(z₂) = ŷ(1 − ŷ)

How does sigmoid output change with input?

Sigmoid derivative. Largest when ŷ ≈ 0.5, near zero when ŷ ≈ 0 or ŷ ≈ 1

∂z₂/∂a₁ = w₂

How does z₂ change with a₁?

Just the weight! Larger w₂ means a₁ has more influence on z₂

∂a₁/∂z₁ = σ'(z₁) = a₁(1 − a₁)

Sigmoid derivative again

This is where vanishing gradients come from—if a₁ is near 0 or 1, gradient is tiny

∂z₁/∂w₁ = x

How does z₁ change with w₁?

Just the input! If x is large, w₁ has a big impact on z₁

Multiply them all together:

∂L/∂w₁=2(ŷ−y) · ŷ(1−ŷ) · w₂ · a₁(1−a₁) · x

The full gradient through 2 layers

The Intuition

Every term in this product has meaning:
2(ŷ−y): How wrong were we? (Error signal)
ŷ(1−ŷ): How “confident” was the output? (Uncertainty)
w₂: How much does layer 1 influence layer 2? (Pathway strength)
a₁(1−a₁): How uncertain was layer 1? (Gradient flow)
x: What was the actual input? (Input credit)

Notice something beautiful: each layer only needs information from the layer above it. We compute∂L/∂ŷ first, then use it to compute ∂L/∂z₂, then ∂L/∂a₁, and so on. This is why it's called backpropagation—we propagate error signals backward through the network.

The General Backpropagation Algorithm

For a network with L layers, define the “error” at layer l as:

δ[l]=∂L/∂z[l]

This is the gradient of the loss with respect to the pre-activation (before applying the activation function). It tells us: “how much does the loss change if we nudge the weighted sum at this layer?”

Step 1
Output Layer Error
δ[L] = ∂L/∂a[L] ⊙ g'(z[L])

Start at the output. The error combines how wrong we were (∂L/∂a) with how sensitive the activation was (g').

💡 For softmax + cross-entropy, this beautifully simplifies to just (ŷ − y)—the difference between prediction and truth.

Step 2
Backpropagate Error
δ[l] = (W[l+1]ᵀ · δ[l+1]) ⊙ g'(z[l])

Each layer receives error from the layer above, weighted by how much it contributed (via W), and scaled by activation sensitivity.

💡 W[l+1]ᵀ distributes 'blame' back to the neurons that contributed most. g'(z[l]) gates whether that neuron can learn.

Step 3
Weight Gradients
∂L/∂W[l] = δ[l] · a[l-1]ᵀ

The gradient for a weight is the error at this layer times the activation that fed into it.

💡 If the input (a[l-1]) was large and the error (δ) is large, this weight contributed a lot to the error—update it more.

Step 4
Bias Gradients
∂L/∂b[l] = δ[l]

Bias gradient is just the error itself (since z = Wa + b, and ∂z/∂b = 1).

💡 The bias shifts everything uniformly, so its gradient is just 'how much should we shift?'

The symbol denotes element-wise multiplication (Hadamard product). This is crucial—we're not doing matrix multiplication here, but multiplying corresponding elements. Each neuron's error gets scaled by its own activation derivative.

Why δ · aᵀ for Weight Gradients?

This is often confusing, so let's think about it carefully. We have W with shape (neurons_out, neurons_in), δ with shape (neurons_out, batch_size), and a with shape (neurons_in, batch_size).

The gradient ∂L/∂Wij tells us: “how much does the loss change if we increase the weight connecting input neuron j to output neuron i?”

The answer is: (error at output neuron i) × (activation at input neuron j). If both are large, this weight is doing a lot of damage and needs a big correction. The outer product δ · aTcomputes this for all weight pairs simultaneously.

(neurons_out × batch) @ (batch × neurons_in) = (neurons_out × neurons_in)

The Gradient Flow Problem

Look at the backprop formula again: δ[l] = (W[l+1]T · δ[l+1]) ⊙ g'(z)[l]

We're multiplying by g'(z) at every layer. For sigmoid, g'(z) is at most 0.25. After 10 layers: 0.2510 ≈ 0.000001. The gradient has effectively vanished!

Vanishing Gradients

When g'(z) < 1 at each layer, gradients shrink exponentially. Early layers barely learn.

Solution: Use ReLU (g' = 1 for z > 0), residual connections, or better initialization.

Exploding Gradients

If ||W|| > 1 and g'(z) ≥ 1, gradients can grow exponentially. Weights become NaN.

Solution: Gradient clipping, proper initialization, batch normalization.

Here's a complete, annotated implementation of backpropagation:

backpropagation.py
1def backward(self, y_true):
2 """Compute gradients via backpropagation."""
3 m = y_true.shape[1] # batch size
4
5 self.dW = [None] * len(self.weights)
6 self.db = [None] * len(self.biases)
7
8 # Output layer error (for softmax + cross-entropy)
9 delta = self.activations[-1] - y_true
10
11 # Backpropagate through layers
12 for l in reversed(range(len(self.weights))):
13 a_prev = self.activations[l]
14
15 # Weight and bias gradients
16 self.dW[l] = (1/m) * (delta @ a_prev.T)
17 self.db[l] = (1/m) * np.sum(delta, axis=1, keepdims=True)
18
19 # Propagate error to previous layer
20 if l > 0:
21 delta = self.weights[l].T @ delta
22 delta = delta * self._activation_derivative(self.z_values[l-1])
23
24 return self.dW, self.db
25
26def _activation_derivative(self, z):
27 """Compute derivative of activation function."""
28 if self.activation == 'relu':
29 return (z > 0).astype(float)
30 elif self.activation == 'sigmoid':
31 s = 1 / (1 + np.exp(-z))
32 return s * (1 - s)
33 elif self.activation == 'tanh':
34 return 1 - np.tanh(z) ** 2

Computational Efficiency

Backpropagation is efficient because it reuses computations. Each gradient only requires information from the layer above (which we just computed) and the forward pass values (which we cached). Computing all gradients takes roughly the same time as a single forward pass.

The full training loop puts forward and backward propagation together:

training_loop.py
1def train(self, X, y, epochs=1000, learning_rate=0.01):
2 """Full training loop."""
3 for epoch in range(epochs):
4 # Forward pass
5 y_pred = self.forward(X)
6
7 # Compute loss
8 loss = self.compute_loss(y, y_pred)
9
10 # Backward pass
11 dW, db = self.backward(y)
12
13 # Update parameters
14 for l in range(len(self.weights)):
15 self.weights[l] -= learning_rate * dW[l]
16 self.biases[l] -= learning_rate * db[l]
17
18 if epoch % 100 == 0:
19 print(f"Epoch {epoch}, Loss: {loss:.4f}")

Interactive: Backpropagation Flow

Watch gradients flow backward through the network

Iteration: 0
Loss: 0.420
InputHiddenOutput
Forward

Input travels through network

Error

Compare output to target

Backward

Error propagates back

Update

Adjust weights to reduce error

Backpropagation sends the error signal backward through the network, allowing each weight to know how much it contributed to the mistake — and adjust accordingly.

Optimization Algorithms

Smarter ways to update weights

Vanilla gradient descent works, but it's often slow and can get stuck in local minima or saddle points. Modern optimizers use various techniques to converge faster and more reliably.

SGD with Momentum

Momentum accumulates gradients from previous steps, building up velocity in consistent directions. Think of a ball rolling downhill—it builds up speed and can roll through small bumps and dips.

v=βv+(1−β)∇Lw ← w − α·v

β ≈ 0.9 typically. Smooths out oscillations and accelerates convergence.

When to use: Almost always. Momentum rarely hurts and often helps significantly, especially with noisy gradients or ill-conditioned problems.

RMSprop

RMSprop adapts the learning rate for each parameter based on the history of its gradients. Parameters with large gradients get smaller learning rates; parameters with small gradients get larger learning rates.

s=βs+(1−β)∇L2w ← w − α·∇L/√(s+ε)

Adaptive learning rates based on gradient history

When to use: Good for RNNs and when different parameters need very different learning rates. Generally superseded by Adam in practice.

Adam (Adaptive Moment Estimation)

Adam combines momentum and RMSprop—it maintains both a running average of gradients (momentum) and a running average of squared gradients (adaptive learning rates). It's the default optimizer for most deep learning applications.

m=β₁m+(1−β₁)∇L(momentum)
v=β₂v+(1−β₂)∇L2(adaptive lr)
=m/(1−β₁t),=v/(1−β₂t)(bias correction)
wwα·/√(+ε)

β₁=0.9, β₂=0.999, ε=1e-8 typically. Bias correction is crucial for early steps.

When to use: Default choice for most problems. Works well out of the box with lr=0.001, β₁=0.9, β₂=0.999.

AdamW (Adam with Decoupled Weight Decay)

AdamW fixes a subtle issue with how Adam handles L2 regularization. In vanilla Adam, L2 regularization is added to the gradient, which interacts poorly with the adaptive learning rate. AdamW applies weight decay directly to the weights instead.

m=β₁m+(1−β₁)∇L(momentum)
v=β₂v+(1−β₂)∇L2(adaptive lr)
=m/(1−β₁t),=v/(1−β₂t)(bias correction)
wwα·(/√(+ε)+λw)(decoupled decay)

λ (weight decay) applied directly to weights, not through gradient. β₁=0.9, β₂=0.999.

When to use: State-of-the-art for transformers and large models. Preferred over Adam when using weight decay regularization.

OptimizerKey IdeaTypical LRBest For
SGDBasic gradient descent0.01 - 0.1Simple problems, fine-tuning
SGD + MomentumAccumulate gradient history0.01 - 0.1Computer vision (CNNs)
RMSpropAdaptive per-parameter LR0.001RNNs
AdamMomentum + Adaptive LR0.001Default for most tasks
AdamWAdam + Decoupled weight decay0.001Transformers, large models

Learning Rate Schedules

Don't use a fixed learning rate! Common schedules include: Step decay (reduce by factor every N epochs), Cosine annealing (smooth decay to 0), Warm-up (start small, ramp up, then decay), and One-cycle (increase then decrease). Warmup is especially important for Adam/AdamW.

Weight Initialization

Starting from the right place

How you initialize weights has a massive impact on training. Initialize too small, and signals vanish. Initialize too large, and signals explode. The goal is to maintain the variance of activations and gradients as they propagate through layers.

The Problem with Bad Initialization

Too Small (e.g., N(0, 0.01))

Activations shrink exponentially with depth. By layer 10, signals are near zero. Gradients vanish and learning stops.

Too Large (e.g., N(0, 1))

Activations grow exponentially with depth. By layer 10, values overflow (NaN). Gradients explode and training diverges.

Xavier/Glorot Initialization

Designed for tanh and sigmoid activations. The variance is set to preserve signal magnitude in both forward and backward passes.

WN(0, 2/(nin + nout))

For tanh/sigmoid activations

When to use: Tanh or sigmoid activations in hidden layers. Also works reasonably with linear layers.

He/Kaiming Initialization

Modified for ReLU activations, which zero out half the inputs. The variance is doubled to account for this.

WN(0, 2/nin)

For ReLU and variants

When to use: ReLU, Leaky ReLU, and other rectified activations. This is the default for modern deep networks.

initialization.py
1def initialize_weights(layer_sizes, activation='relu'):
2 """Initialize weights using appropriate scheme."""
3 weights, biases = [], []
4
5 for i in range(1, len(layer_sizes)):
6 fan_in = layer_sizes[i-1]
7 fan_out = layer_sizes[i]
8
9 if activation in ['relu', 'leaky_relu']:
10 # He initialization
11 std = np.sqrt(2.0 / fan_in)
12 else:
13 # Xavier initialization
14 std = np.sqrt(2.0 / (fan_in + fan_out))
15
16 W = np.random.randn(fan_out, fan_in) * std
17 b = np.zeros((fan_out, 1)) # Biases typically init to 0
18
19 weights.append(W)
20 biases.append(b)
21
22 return weights, biases
MethodFormulaUse With
Xavier/GlorotVar = 2/(n_in + n_out)Tanh, Sigmoid
He/KaimingVar = 2/n_inReLU, Leaky ReLU
LeCunVar = 1/n_inSELU
OrthogonalQR decompositionRNNs, very deep nets
Biases are typically initialized to zero. For ReLU, some practitioners use small positive values (0.01) to ensure neurons are active initially, but zero usually works fine with proper weight initialization.

Regularization Techniques

Preventing overfitting

Neural networks have millions of parameters and can easily memorize training data instead of learning generalizable patterns. Regularization techniques add constraints or noise to prevent this overfitting.

L2 Regularization (Weight Decay)

Add a penalty proportional to the squared magnitude of weights. This encourages smaller weights, which leads to simpler, more generalizable models.

Ltotal=L+(λ/2) Σ ||W||²

λ controls regularization strength (typically 1e-4 to 1e-2)

When to use: Almost always as a baseline regularizer. Use AdamW for proper implementation with adaptive optimizers.

L1 Regularization (Lasso)

Penalizes the absolute value of weights. Unlike L2, this encourages sparsity—many weights become exactly zero, effectively performing feature selection.

Ltotal=L+λ Σ |W|

Promotes sparsity in weights

When to use: When you want interpretable sparse models or automatic feature selection. Less common than L2 in deep learning.

Dropout

During training, randomly set a fraction of neurons to zero. This prevents neurons from co-adapting and forces the network to learn redundant representations. At test time, use all neurons but scale down by the dropout rate.

dropout.py
1def dropout_forward(a, dropout_rate, training=True):
2 """Apply dropout during forward pass."""
3 if not training or dropout_rate == 0:
4 return a, None
5
6 # Create binary mask
7 keep_prob = 1 - dropout_rate
8 mask = np.random.binomial(1, keep_prob, size=a.shape) / keep_prob
9
10 # Apply mask (inverted dropout - scale during training)
11 return a * mask, mask
12
13def dropout_backward(da, mask):
14 """Backprop through dropout."""
15 return da * mask if mask is not None else da

When to use: Default regularizer for fully connected layers. Typical rate: 0.5 for hidden layers, 0.2 for input layer. Less common in CNNs (use batch norm instead).

Early Stopping

Monitor validation loss during training. When it stops improving (or starts increasing), stop training and use the best model. This is one of the most effective and simple regularization techniques.

early_stopping.py
1class EarlyStopping:
2 def __init__(self, patience=10, min_delta=0.001):
3 self.patience = patience
4 self.min_delta = min_delta
5 self.best_loss = float('inf')
6 self.counter = 0
7 self.best_weights = None
8
9 def __call__(self, val_loss, model):
10 if val_loss < self.best_loss - self.min_delta:
11 self.best_loss = val_loss
12 self.counter = 0
13 self.best_weights = model.get_weights() # Save best
14 else:
15 self.counter += 1
16
17 if self.counter >= self.patience:
18 model.set_weights(self.best_weights) # Restore best
19 return True # Stop training
20 return False

When to use: Always! There's no reason not to use early stopping. Set patience high enough (10-20 epochs) to avoid stopping too early.

Data Augmentation

Create synthetic training examples by applying transformations that preserve labels. For images: rotations, flips, crops, color jittering. For text: synonym replacement, back-translation. Effectively increases dataset size without collecting more data.

Common augmentations:
  • Images: Random crop, flip, rotation, color jitter, mixup, cutout
  • Text: Synonym replacement, random deletion, back-translation
  • Audio: Time stretching, pitch shifting, noise injection
  • Tabular: SMOTE, feature noise, mixup

When to use: Whenever you have limited data. One of the most effective ways to improve generalization, especially for images.

TechniqueHow It WorksTypical Settings
L2 (Weight Decay)Penalize large weightsλ = 1e-4 to 1e-2
L1Encourage sparse weightsλ = 1e-5 to 1e-3
DropoutRandomly zero neuronsp = 0.2 to 0.5
Early StoppingStop when val loss stops improvingpatience = 10-20
Data AugmentationCreate synthetic training dataDomain-specific

Interactive: Dropout Regularization

Toggle between training and inference to see dropout in action

InputHidden 1Hidden 2Output
Training (iter 0)
Dropout Rate50%
Hidden Neurons
10/ 10
0 dropped

During training, 50% of hidden neurons are randomly "dropped" (set to 0) each forward pass. This prevents co-adaptation.

Click "New Random Mask" to see different dropout patterns. Each training step sees a different "thinned" network, effectively training an ensemble of sub-networks.

Normalization Techniques

Keeping activations stable

As data flows through a deep network, the distribution of activations can shift dramatically (this is called “internal covariate shift”). Normalization techniques stabilize these distributions, making training faster and more stable.

Batch Normalization

Normalize activations across the batch dimension. For each feature, compute mean and variance across all samples in the batch, then normalize. Learnable parameters γ and β allow the network to undo the normalization if needed.

=(x − μB) / √(σ²B + ε)y = γx̂ + β

μ_B and σ²_B computed per-feature across the batch

When to use: Standard in CNNs, applied after linear layers and before activations. Allows higher learning rates and acts as regularizer. Needs sufficiently large batch sizes (≥32).

Layer Normalization

Normalize across features for each sample independently. Unlike batch norm, statistics are computed per-sample, so it works with any batch size and is essential for sequence models.

=(x − μL) / √(σ²L + ε)

μ_L and σ²_L computed per-sample across features

When to use: Standard in transformers and RNNs. Works with batch size 1. Essential for sequence modeling where batch statistics don't make sense.

Instance Normalization

Normalize per-sample, per-channel for images. Used in style transfer and GANs where you want to remove style information.

Group Normalization

Divide channels into groups, normalize within each group. Works with small batch sizes. Compromise between batch norm and instance norm.

RMSNorm

Simplified layer norm using only RMS, no mean centering. Faster and works well in large language models (LLMs).

Weight Normalization

Decouple weight magnitude from direction. Reparameterize W = g(v/||v||). Sometimes faster than batch norm.

MethodNormalizes OverBest For
Batch NormBatch dimensionCNNs, large batch sizes
Layer NormFeature dimensionTransformers, RNNs
Instance NormSpatial dimensionsStyle transfer, GANs
Group NormChannel groupsSmall batch sizes, detection
RMSNormFeature dimension (RMS only)LLMs, efficiency-focused

Placement Matters

Batch norm is typically applied after the linear layer but before the activation function. However, some architectures (especially transformers) apply layer norm before the attention/feedforward layers (pre-norm) rather than after (post-norm).

Practical Considerations

Making it work in the real world

Theory is one thing; getting neural networks to actually work is another. Here are the practical techniques that separate working models from frustrating debugging sessions.

Gradient Clipping

Prevent exploding gradients by capping gradient magnitudes. Essential for RNNs and sometimes helpful for very deep networks or when using large learning rates.

gradient_clipping.py
1def clip_gradients_by_norm(grads, max_norm=1.0):
2 """Clip gradients to maximum norm."""
3 total_norm = np.sqrt(sum(np.sum(g**2) for g in grads))
4 clip_coef = max_norm / (total_norm + 1e-6)
5
6 if clip_coef < 1:
7 grads = [g * clip_coef for g in grads]
8
9 return grads
10
11def clip_gradients_by_value(grads, clip_value=1.0):
12 """Clip gradients to [-clip_value, clip_value]."""
13 return [np.clip(g, -clip_value, clip_value) for g in grads]

When to use: Always for RNNs/LSTMs. Use norm clipping (typically max_norm=1.0 or 5.0) rather than value clipping. Monitor gradient norms during training.

Learning Rate Finding

The learning rate is the most important hyperparameter. Too high and training diverges; too low and training takes forever. The LR finder technique: start with a tiny LR and exponentially increase it while recording loss. Plot loss vs. LR and pick a value just before loss starts increasing.

lr_finder.py
1def find_lr(model, train_data, min_lr=1e-7, max_lr=10, steps=100):
2 """Find optimal learning rate using the LR range test."""
3 lrs, losses = [], []
4 lr = min_lr
5 lr_mult = (max_lr / min_lr) ** (1 / steps)
6
7 for i in range(steps):
8 loss = train_one_batch(model, train_data, lr)
9 lrs.append(lr)
10 losses.append(loss)
11
12 lr *= lr_mult
13 if loss > 4 * min(losses): # Stop if loss explodes
14 break
15
16 # Plot lrs vs losses, pick LR where loss is still decreasing
17 return lrs, losses

Debugging Neural Networks

1Overfit a single batch first. If your model can't memorize 10 samples, something is fundamentally wrong.
2Check your data pipeline. Visualize inputs, verify labels, ensure normalization is correct. Most bugs are in data, not models.
3Monitor gradient norms. If they vanish (→0) or explode (→∞), you have initialization or architecture problems.
4Start simple. Get a small model working first, then scale up. Don't debug a 100-layer network.
5Use gradient checking. Numerically verify your backprop implementation matches finite differences.

Hyperparameter Starting Points

Learning Rate
Adam: 3e-4 to 1e-3. SGD: 0.01 to 0.1. Use finder.
Batch Size
32-256 typical. Larger = faster but may need LR adjustment.
Hidden Size
Start with power of 2: 64, 128, 256, 512. Wider often helps.
Depth
Start shallow (2-3 layers). Add depth if needed.

Beyond Feedforward: CNNs and RNNs

Specialized architectures for different data

The feedforward networks we've discussed so far treat input as a flat vector. But many real-world data types have structure: images have spatial structure, sequences have temporal structure. Specialized architectures exploit this structure.

Convolutional Neural Networks (CNNs)

CNNs are designed for data with grid-like topology, especially images. Instead of fully connected layers, they use convolutions—small filters that slide across the input, detecting local patterns.

Convolution

Local pattern detection. A 3×3 filter detects edges, textures, shapes in local regions.

Pooling

Downsampling. Max pool takes the maximum in each region, providing translation invariance.

Hierarchy

Early layers detect edges; deeper layers detect objects, faces, concepts.

Key properties:
  • Parameter sharing: Same filter applied everywhere, drastically reducing parameters
  • Translation equivariance: Detects patterns regardless of position
  • Local connectivity: Each neuron only sees a small region (receptive field)

Use cases: Image classification, object detection, segmentation, video analysis, and increasingly text processing (1D convolutions).

Recurrent Neural Networks (RNNs)

RNNs process sequential data by maintaining a hidden state that gets updated at each timestep. This allows them to remember information from earlier in the sequence.

ht=tanh(Whhht-1+Wxhxt)

Hidden state updated with previous state and current input

Variants:
  • LSTM: Gated cells with forget/input/output gates. Solves vanishing gradient problem.
  • GRU: Simplified LSTM with fewer parameters. Often works just as well.
  • Bidirectional: Process sequence in both directions for context from past and future.

Use cases: Language modeling, machine translation, speech recognition, time series. Note: Transformers have largely replaced RNNs for most NLP tasks.

Transformers (The New Standard)

Transformers have revolutionized deep learning. Instead of recurrence, they use self-attention to relate different positions in a sequence directly. This allows parallel processing and captures long-range dependencies more effectively.

Key components:
  • Self-attention: Each position attends to all other positions with learned weights
  • Multi-head attention: Multiple attention patterns capture different relationships
  • Positional encoding: Injects position information (since there's no recurrence)
  • Layer norm + residuals: Enables training very deep models (100+ layers)

Impact: Powers GPT, BERT, and virtually all modern LLMs. Also successful in vision (ViT), protein folding (AlphaFold), and many other domains.

ArchitectureBest ForKey Innovation
Feedforward (MLP)Tabular data, simple tasksUniversal approximation
CNNImages, spatial dataLocal pattern detection, parameter sharing
RNN/LSTMSequences, time seriesHidden state captures temporal patterns
TransformerNLP, increasingly everythingSelf-attention, parallel processing

Which Architecture to Use?

Images? Start with CNNs (ResNet, EfficientNet) or Vision Transformers for large datasets. Text/Sequences? Transformers for most tasks; LSTMs if you need low latency or small models. Tabular? Often gradient boosting beats neural nets, but MLPs work too. Unsure? Start simple (MLP), then add inductive biases (convolutions, attention) if needed.

We've covered neural networks from first principles: perceptrons, forward propagation, activation functions, loss functions, backpropagation, optimizers, initialization, regularization, normalization, and architecture variants. This is the foundation that powers everything from simple classifiers to GPT-4.

The field moves fast—new architectures, training techniques, and scaling laws emerge constantly. But the fundamentals we've covered here remain relevant. Understandingwhy things work lets you adapt to new developments and debug when things don't work.

Happy coding, and may your gradients always flow.