ANOVA — Feel What the F-Statistic Really Means

Why can't you just repeat t-tests for three or more groups? Move the sliders to see how the ratio of between-group to within-group variance becomes the F-statistic — and feel when it crosses the rejection threshold.

I.07 / ANOVA

ANOVA — Feel What the F-Statistic Really Means

The chi-squared test quantified "category deviations." But how do we test whether means differ across three or more groups? Repeating t-tests inflates false positives — ANOVA compares all groups in a single test.

F = between-group variance ÷ within-group variance.
When groups differ a lot and individuals within each group are consistent, F grows large — suggesting real differences. Move the sliders to feel it.

Experiment Guide — Multiple Comparisons Problem

Step 1: Click ▶ 100 trials — runs 100 experiments where 3 identical groups are compared with 3 pairwise t-tests.
Step 2: Check the red dot ratio — theory predicts 14.3%. What do you get?
Step 3: Click ▶ 1000 trials — as trials accumulate, the rate converges toward 14.3%.

▶ Why You Can't Just Repeat t-Tests

First, experience why ANOVA is needed. With 3 identical groups (N(0,1), n=20), repeat 3 pairwise t-tests — how bad does the false-positive rate get?

Trials 0

False positives 0

FP rate 0.0%

// Why 14.3%?

Pairwise comparisons among 3 groups: m = 3 (A-B, B-C, A-C).
Each t-test has a 5% false positive rate. The probability of getting all three right is 0.95³ ≈ 0.857.
So "at least one error" = 1 − 0.857 ≈ 14.3%.
It gets worse as groups increase: 5 groups → m = 10 → 40%, 10 groups → m = 45 → over 90%.

ANOVA solves this by comparing all groups simultaneously in a single test.

The F-test (ANOVA) below is how we avoid the false-positive inflation you just saw in the simulation above.

Experiment Guide — Feel the F-Statistic

Step 1: Set between-group difference to zero → F ≈ 1. All groups look like one population.
Step 2: Increase between-group difference → F rises, p drops. Watch it cross the rejection threshold.
Step 3: Increase within-group spread → same difference but F drops. "Real differences can hide in noise."
Step 4: Increase sample size → F rises. Larger samples detect smaller effects (statistical power).

▶ Between vs. Within — Feel the F-Statistic

Between-group diff = 1.5

Within-group spread (σ) = 1.5

n per group = 20

F —

Between df —

Within df —

p —

SSB —

SSW —

η² —

Decision —

// ANOVA Table

Source	SS	df	MS	F	p
Between (B)	—	—	—	—	—
Within (W)	—	—	—
Total	—	—

// What the F-statistic actually does

Suppose you want to compare test scores across three classes.
"The class averages are far apart" → maybe the teaching method matters.
"But the scores within each class also vary a lot" → could just be noise.

The F-statistic puts a number on that comparison:
between-group spread ÷ within-group spread — that's really all it is.
The bigger the numerator relative to the denominator, the more evidence that the groups genuinely differ.

Here's the formula. Don't worry about memorizing it right away — try dragging the graph above and watching how F reacts first.

· SS_B (between-group) = how far each group mean is from the overall mean
· SS_W (within-group) = how much data scatter inside each group
· Dividing by df gives MS — "variation per degree of freedom"
· F ≈ 1 → "no evidence of differences"; F large → "groups likely differ"

How large is "large enough"? That depends on degrees of freedom and significance level α.
Check critical values in the Interactive Distribution Tables.

// Walk through it — test-score example

Three teaching methods A, B, C with 15 students each.

① Look at the means
　Group A = 65, Group B = 72, Group C = 73, Grand mean = 70

② Between-group SS_B
　n × (group mean − grand mean)² summed up
　= 15×(65−70)² + 15×(72−70)² + 15×(73−70)² = 15×(25+4+9) = 570

③ Within-group SS_W
　Total scatter of data around their own group means = say 2520

④ Degrees of freedom
　df_B = k−1 = 3−1 = 2
　df_W = N−k = 45−3 = 42

⑤ Compute MS and F
　MSB = 570÷2 = 285, MSW = 2520÷42 = 60
　F = 285÷60 = 4.75

⑥ Verdict
　The critical value of F(2, 42) at α = 0.05 is about 3.22.
　4.75 > 3.22 → Reject H₀ (at least one group differs).
　— But we still don't know which groups differ (→ post-hoc tests needed).

// Easy-to-trip-on points

"ANOVA tells you which groups differ" — not quite

ANOVA only tells you "there's a difference somewhere." To pin down which pairs, you need post-hoc tests like Tukey's HSD or Bonferroni correction.

"Why not just run multiple t-tests?"

Check the simulation at the top of this page. With 3 groups and repeated t-tests, the false-positive rate jumps to about 14.3% even when no real difference exists. ANOVA keeps the overall α at 5%.

"Bigger F = bigger effect" — careful

A large sample size can inflate F even for a tiny real difference. To measure how large the effect actually is, use effect size (η²).

// Exam patterns

These come up frequently on stats exams and certification tests.

Fill in an ANOVA table — given SS and df, compute MS → F → compare with critical value. The worked example above is the exact template
Degrees of freedom — between = k−1, within = N−k, total = N−1
"Why not repeat t-tests?" — explain the multiple-comparisons problem. The 14.3% from the simulation is a concrete talking point
Assumptions — normality, homogeneity of variance, independence

UP NEXT —catching a relationship with a line ▸ M.01 Correlation