Skip to main content
StatPlay Topics ANOVA

ANOVA — Feel What the F-Statistic Really Means

Run a hundred experiments and the reason you can't just repeat t-tests becomes visible: the false-positive rate jumps from 5% to 14.3%.

I.07 / ANOVA

ANOVA — Feel What the F-Statistic Really Means

The chi-squared test quantified "category deviations." But how do we test whether means differ across three or more groups? Repeating t-tests inflates false positives — ANOVA compares all groups in a single test.

ANOVA compares "how different the groups are" vs. "how spread out each group is."
When the difference outweighs the spread, we conclude the groups really differ. Move the sliders to feel it.

Experiment Guide — Multiple Comparisons Problem
  1. Step 1: Click ▶ 100 trials — runs 100 experiments where 3 identical groups are compared with 3 pairwise t-tests.
  2. Step 2: Check the red dot ratio — theory predicts 14.3%. What do you get?
  3. Step 3: Click ▶ 1000 trials — as trials accumulate, the rate converges toward 14.3%.

▶ Why You Can't Just Repeat t-Tests

First, experience why ANOVA is needed. With 3 identical groups (N(0,1), n=20), repeat 3 pairwise t-tests — how bad does the false-positive rate get?
Trials 0
False positives 0
FP rate 0.0%

The F-test (ANOVA) below is how we avoid the false-positive inflation you just saw in the simulation above.

Experiment Guide — Feel the F-Statistic
  1. Step 1: Set between-group difference to zero → F ≈ 1. All groups look like one population.
  2. Step 2: Increase between-group difference → F rises, p drops. Watch it cross the rejection threshold.
  3. Step 3: Increase within-group spread → same difference but F drops. "Real differences can hide in noise."
  4. Step 4: Increase sample size → F rises. Larger samples detect smaller effects (statistical power).

▶ Between vs. Within — Feel the F-Statistic

F
Between df
Within df
p
SSB
SSW
η²
Decision

// ANOVA Table

SourceSSdfMSFp
Between (B)
Within (W)
Total

// What the F-statistic actually does

Suppose you want to compare test scores across three classes.
"The class averages are far apart" → maybe the teaching method matters.
"But the scores within each class also vary a lot" → could just be noise.

The F-statistic puts a number on that comparison:
between-group spread ÷ within-group spread — that's really all it is.
The bigger the numerator relative to the denominator, the more evidence that the groups genuinely differ.

Here's the formula. Don't worry about memorizing it right away — try dragging the graph above and watching how F reacts first.

· SSB (between-group) = how far each group mean is from the overall mean
· SSW (within-group) = how much data scatter inside each group
· Dividing by df gives MS — "variation per degree of freedom"
· F ≈ 1 → "no evidence of differences"; F large → "groups likely differ"

How large is "large enough"? That depends on degrees of freedom and significance level α.
Check critical values in the Interactive Distribution Tables.

// Walk through it — test-score example

Three teaching methods A, B, C with 15 students each.

① Look at the means
 Group A = 65, Group B = 72, Group C = 73, Grand mean = 70

② Between-group SSB
 n × (group mean − grand mean)² summed up
 = 15×(65−70)² + 15×(72−70)² + 15×(73−70)² = 15×(25+4+9) = 570

③ Within-group SSW
 Total scatter of data around their own group means = say 2520

④ Degrees of freedom
 dfB = k−1 = 3−1 = 2
 dfW = N−k = 45−3 = 42

⑤ Compute MS and F
 MSB = 570÷2 = 285, MSW = 2520÷42 = 60
 F = 285÷60 = 4.75

⑥ Verdict
 The critical value of F(2, 42) at α = 0.05 is about 3.22.
 4.75 > 3.22 → Reject H₀ (at least one group differs).
 — But we still don't know which groups differ (→ post-hoc tests needed).

// Easy-to-trip-on points

"ANOVA tells you which groups differ" — not quite

ANOVA only tells you "there's a difference somewhere." To pin down which pairs, you need post-hoc tests like Tukey's HSD or Bonferroni correction.

"Why not just run multiple t-tests?"

Check the simulation at the top of this page. With 3 groups and repeated t-tests, the false-positive rate jumps to about 14.3% even when no real difference exists. ANOVA keeps the overall α at 5%.

"Bigger F = bigger effect" — careful

A large sample size can inflate F even for a tiny real difference. To measure how large the effect actually is, use effect size (η²).

// Shapes you'll meet again

The same structure shows up again and again whenever ANOVA is around.

  • The shape of an ANOVA table — once SS and df are lined up, MS and F fall into the next columns automatically. The worked example above traces this exact shape
  • How degrees of freedom slot in — between = k−1, within = N−k, total = N−1. These three always appear together
  • Why repeated t-tests inflate — the multiple-comparisons story. The 14.3% from the simulation reappears in this same shape
  • The three assumptions in the background — normality, homogeneity of variance, independence. The F-distribution picture above only takes shape when these three are in place
UP NEXT —measuring the "link" between two variables M0 Correlation