Contents
Bayes' theorem is one of the most fundamental formulas in statistics, and for good reason. It gives us a principled way to update our beliefs as we gather evidence, turning prior knowledge and new observations into posterior probabilities. In my life, I'm constantly using this theorem as a way of thinking. In machine learning, we can leverage this to build classifiers that think in terms of probabilities rather than boundaries.
Instead of asking “where should I draw the line between classes?” like logistic regression or SVMs, Naive Bayes asks “what's the probability this sample belongs to each class?” It's called “naive” because it makes a strong independence assumption that's often violated in practice, yet somehow it works remarkably well anyway. Let's build it from scratch and understand why this probabilistic approach is so effective.
Before You Continue
Bayes' Theorem: The Foundation
The foundation of probabilistic reasoning
Everything starts with Bayes' theorem. All it's saying is that the probability of H given E equals the probability of E given H, times the prior probability of H, divided by the probability of E.
Update beliefs by combining prior knowledge with new evidence
For classification, we rewrite this as:
Probability of class given features
P(y|X) — Posterior
Probability of class y given features X. This is what we want to find.
P(X|y) — Likelihood
Probability of seeing features X if the class is y.
P(y) — Prior
Probability of class y before seeing any features. Just count class frequencies.
P(X) — Evidence
Probability of seeing features X. Same for all classes, so we can ignore it for comparison.
To classify a new sample, we compute P(y|X) for each class and pick the one with highest probability. Since P(X) is the same for all classes, we can simplify to:
We just need to compute the likelihood and prior for each class
Interactive: Bayes' Theorem for Spam Detection
Adjust the probabilities to see how prior beliefs and evidence combine
What % of all emails are spam?
If it's spam, how likely does it contain “FREE”?
If it's not spam, how likely does it contain “FREE”?
Leaning spam, but not definitive.
The “Naive” Independence Assumption
A simplifying assumption that works
Here's where the “naive” part comes in. Computing P(X|y) directly is hard when X has many features. If you have 10 binary features, there are 2¹⁰ = 1024 possible combinations. You'd need to estimate the probability of each combination for each class. Additionally, for a small to mid-sized dataset, each of those combinations would only have 1–2 examples at most.
The Naive Assumption
All features are conditionally independent given the class.
This is almost always false in reality. In spam detection, the presence of “free” and “money” in an email are clearly correlated. But this assumption makes the math tractable such that instead of estimating one probability for every feature combination, we estimate one probability per feature.
We only care about which class is most probable, not the exact probabilities
Even if the independence assumption is violated, the ranking of classes is often still correct
It's a form of regularization that prevents overfitting
The General Algorithm
Simple yet powerful
For any Naive Bayes classifier, the algorithm is beautifully simple:
Training
- 1.Calculate P(y) for each class (just count frequencies)
- 2.Calculate P(xᵢ|y) for each feature i and class y
- 3.Store these probabilities
Prediction
- 1.For each class y, compute P(y) × ∏P(xᵢ|y)
- 2.Pick the class with highest value
How do we calculate P(xᵢ|y)? This depends on the type of features, which is why we have different variants of Naive Bayes.
Gaussian Naive Bayes
For continuous features
When features are continuous (like height, temperature, age), we assume they follow a Gaussian (normal) distribution within each class.
Gaussian PDF where μ is mean and σ² is variance of feature for class y
1def fit(self, X, y):2 """Train Gaussian NB: compute mean and variance per class."""3 X, y = np.array(X), np.array(y).flatten()4 self.classes_ = np.unique(y)5 n_features = X.shape[1]6 n_classes = len(self.classes_)7 8 # Initialize parameters9 self.theta_ = np.zeros((n_classes, n_features)) # means10 self.sigma_ = np.zeros((n_classes, n_features)) # variances11 self.class_prior_ = np.zeros(n_classes)12 13 for idx, c in enumerate(self.classes_):14 X_c = X[y == c]15 self.theta_[idx, :] = X_c.mean(axis=0) # mean μ16 self.sigma_[idx, :] = X_c.var(axis=0) # variance σ²17 self.class_prior_[idx] = X_c.shape[0] / X.shape[0] # P(y)18 19 # Add smoothing to avoid division by zero20 self.epsilon_ = self.var_smoothing * np.var(X, axis=0).max()21 self.sigma_ += self.epsilon_For prediction, we compute the log likelihood to avoid numerical underflow (multiplying many small probabilities leads to tiny numbers):
1def predict(self, X):2 """Predict using log probabilities for numerical stability."""3 X = np.array(X)4 log_probs = np.zeros((X.shape[0], len(self.classes_)))5 6 for idx in range(len(self.classes_)):7 # Log prior8 log_prior = np.log(self.class_prior_[idx])9 10 # Log of Gaussian PDF11 log_likelihood = -0.5 * np.sum(np.log(2.0 * np.pi * self.sigma_[idx, :]))12 log_likelihood -= 0.5 * np.sum(13 ((X - self.theta_[idx, :]) ** 2) / self.sigma_[idx, :], 14 axis=115 )16 17 log_probs[:, idx] = log_prior + log_likelihood18 19 return self.classes_[np.argmax(log_probs, axis=1)]Interactive: Gaussian Naive Bayes Classification
Move the slider to classify a flower based on petal length using Gaussian distributions
Why Log Probabilities?
Multinomial Naive Bayes
For count data
When features represent counts (like word frequencies in documents), we use the multinomial distribution. This is the go-to for text classification.
N_yi = count of feature i in class y, N_y = total count in class y, α = smoothing
1def fit(self, X, y):2 """Train Multinomial NB with Laplace smoothing."""3 X, y = np.array(X), np.array(y).flatten()4 self.classes_ = np.unique(y)5 n_features = X.shape[1]6 7 self.feature_log_prob_ = np.zeros((len(self.classes_), n_features))8 self.class_prior_ = np.zeros(len(self.classes_))9 10 for idx, c in enumerate(self.classes_):11 X_c = X[y == c]12 13 # Feature counts with Laplace smoothing14 feature_count = X_c.sum(axis=0) + self.alpha15 total_count = feature_count.sum()16 17 # Store log probabilities for numerical stability18 self.feature_log_prob_[idx, :] = np.log(feature_count / total_count)19 self.class_prior_[idx] = X_c.shape[0] / X.shape[0]Laplace Smoothing
The α parameter (usually α=1) prevents zero probabilities. Without smoothing, if a word never appears in spam emails during training, any email containing that word would have P(X|spam) = 0, making it impossible to classify as spam regardless of other strong spam indicators.
Example: Spam Detection
Ham: “meeting tomorrow”, “lunch plans”
P(“meeting”|spam) = (0+1)/(6+8) = 1/14 ≈ 0.07 (smoothed!)
Bernoulli Naive Bayes
For binary features
When features are binary (present/absent), we use the Bernoulli distribution.
p = P(feature present | class y)
This might look cryptic, but it's elegant. If feature is present (xᵢ=1), P = p. If feature is absent (xᵢ=0), P = 1−p. The formula handles both cases in one expression.
1def fit(self, X, y):2 """Train Bernoulli NB: estimate P(feature=1|class)."""3 X, y = np.array(X), np.array(y).flatten()4 self.classes_ = np.unique(y)5 n_features = X.shape[1]6 7 # Store both log P(x=1|y) and log P(x=0|y)8 self.feature_log_prob_ = np.zeros((len(self.classes_), 2, n_features))9 self.class_prior_ = np.zeros(len(self.classes_))10 11 for idx, c in enumerate(self.classes_):12 X_c = X[y == c]13 14 # Probability of feature being 1 (with smoothing)15 feature_count = X_c.sum(axis=0) + self.alpha16 total_count = X_c.shape[0] + 2 * self.alpha17 feature_prob = feature_count / total_count18 19 # Store log P(x=1|y) and log P(x=0|y)20 self.feature_log_prob_[idx, 0, :] = np.log(feature_prob) # log P(1|y)21 self.feature_log_prob_[idx, 1, :] = np.log(1 - feature_prob) # log P(0|y)22 self.class_prior_[idx] = X_c.shape[0] / X.shape[0]1def predict(self, X):2 """Predict using both presence AND absence of features."""3 X = np.array(X)4 log_probs = np.zeros((X.shape[0], len(self.classes_)))5 6 for idx in range(len(self.classes_)):7 log_prior = np.log(self.class_prior_[idx])8 9 # Key difference: considers BOTH presence and absence10 log_likelihood = np.sum(11 X * self.feature_log_prob_[idx, 0, :] + # when x=112 (1 - X) * self.feature_log_prob_[idx, 1, :], # when x=013 axis=114 )15 16 log_probs[:, idx] = log_prior + log_likelihood17 18 return self.classes_[np.argmax(log_probs, axis=1)]Bernoulli vs Multinomial
Which Variant to Use?
Matching variants to data
The specific variation of Naive Bayes you'll implement depends a lot on the shape of your data. For example, it wouldn't make any sense to use a Bernoulli distribution when you have continuous data like age or salary.
Interactive: Compare Naive Bayes Variants
Click each variant to see how it handles different data types
| Variant | Feature Type | Best For |
|---|---|---|
| Gaussian | Continuous | Iris, medical measurements, sensor data |
| Multinomial | Counts | Text classification, word frequencies |
| Bernoulli | Binary | Document classification, feature presence |
Naive Bayes in Practice
Real-world considerations
Some things I found that you should definitely take note of before applying this algorithm.
⚠ Limitations
- • Independence assumption often badly violated
- • Sensitive to irrelevant features
- • Can't capture feature interactions
- • Probability estimates often poorly calibrated
✓ Strengths
- • Excellent baseline—always try it first
- • Works surprisingly well for text
- • Handles minimal data without overfitting
- • Real-time classification (extremely fast)
It's an excellent baseline
Always try Naive Bayes first. If your fancy deep learning model can't beat it, something's wrong.
It works surprisingly well for text
Despite obvious word dependencies, Naive Bayes is competitive with much more complex models for text classification. It's the default choice for spam filters.
Works with minimal data
When you have 50 training examples, Naive Bayes might be your only option that doesn't overfit.
Real-time classification
The speed makes it perfect for applications needing instant predictions on streaming data.
Feature selection matters
Since all features are multiplied together, irrelevant features can hurt performance. Remove noise.
The Bottom Line
Happy coding, and don't be naive out there.