Chi-Squared Test — Goodness-of-Fit & Independence

Whether a die is fair, whether two categorical variables are independent — the same formula answers both. Observed minus expected, compressed into one number.

I.06 / CHI-SQUARED TEST

Chi-Squared Test — Quantifying the Gap

So far we've tested means. But some data is purely categorical — survey choices, dice outcomes, disease × smoking. The chi-squared test quantifies "deviation from expectation" for these counts. The bigger the mismatch, the brighter the χ² statistic glows.

Goodness-of-fit asks: "Does the observed category distribution match a theoretical one?" Classic example: is the die fair?
Test of independence asks: "Are two categorical variables independent?" Compute χ² = Σ (O−E)²/E across every cell of the contingency table.
Why divide by E? → A deviation of 2 from an expected 10 matters more than 2 from an expected 1,000. Dividing by E turns raw gaps into relative ones.
Both use a χ²-distributed statistic; the p-value is the right-tail area. df = k−1 for goodness-of-fit, (r−1)(c−1) for independence.

Experiment Guide — Goodness-of-Fit

Step 1: Select "Fair die" and ▶ auto-roll → as n grows, χ² stays low, p-value stays high.
Step 2: Switch to "Loaded" and auto-roll → χ² shoots up, p-value drops below α. Cheat detected.
Step 3: Lower α to 0.01 → stricter threshold. Small deviations won't reject.

▶ ① Goodness-of-Fit — Is the Die Fair?

🎲 Click a bar on the left to add one roll (Shift+click to subtract)

Significance α = 0.05

Rolls n 0

Test statistic χ² —

df —

p-value —

Decision —

Experiment Guide — Independence Test

Step 1: Click only the top-left cell to create imbalance → χ² spikes, independence is rejected.
Step 2: ↺ Reset and click cells evenly → χ² stays small. No imbalance = independence holds.
Step 3: Change α to see "how much imbalance is needed to reject independence."

▶ ② Test of Independence — Are Two Variables Independent?

Click a cell on the left to add +1 (Shift+click for −1)

Significance α = 0.05

Total n 0

Test statistic χ² —

df —

p-value —

Decision —

Interactive Distribution Tables — look up χ² critical values for any df

// Formula used here — Goodness of fit

Worked example (120 rolls of a die)
• A fair die should give E = 20 per face
• Actual results: [25, 17, 15, 23, 22, 18]
• Face 1: (25−20)²/20 = 1.25
• Face 2: (17−20)²/20 = 0.45 … sum all → χ² = 3.80
• df = 6−1 = 5, critical value at α = 0.05 is 11.07. Since 3.80 < 11.07 → fail to reject. Consistent with a fair die

Each part
• O: observed frequency (what you actually counted)
• E: expected frequency (what the theory predicts)
• (O−E)²/E: using raw (O−E)² would unfairly penalize large categories. Dividing by E gives "relative departure from expectation"
• Sum across all categories → a single number measuring overall fit

// Formula used here — Independence test

Worked example (2×2 table: gender × smoking)
• 200 men (60 smokers), 300 women (60 smokers), 500 total
• If gender and smoking are independent, expected male smokers?
• E = 200 × 120/500 = 48 ← row total × column total ÷ grand total
• "Under independence we'd expect 48, but observed 60" → the larger the gap, the stronger the evidence against independence

Why df = (r−1)(c−1)
• In a 2×2 table: once row and column totals are fixed, choosing one cell determines the rest → df = 1
• A 3×4 table: df = 2 × 3 = 6

Caution: expected counts below 5
• The χ² approximation becomes unreliable. Rule of thumb: if more than 20% of cells have E < 5, use an exact test for small frequencies or merge categories

// Goodness of fit vs. independence test — what's the difference?

Both use χ² = Σ(O−E)²/E, but the question is different:

Goodness of fit: "Is this die fair?" "Does this data follow a normal distribution?" → compares one variable to a theoretical distribution
Independence test: "Is gender associated with smoking?" → examines the relationship between two variables

Expected frequencies are computed differently:

Goodness of fit: directly from theory (e.g., n/6 for a fair die)
Independence: row total × column total ÷ grand total (back-calculated from the independence assumption)

// Why always a right-tail test?

χ² values are sums of squared deviations, so they're always ≥ 0. Larger values mean worse fit.

Large χ² → data doesn't match the theory → reject
Small χ² → good fit → fail to reject
The rejection region is always in the right tail. There's no left-tail rejection

// Common misconceptions

❌ "Significant χ² test = causal relationship"

All it can say is "there's an association." "Smoking is associated with lung cancer" ≠ "smoking causes lung cancer." Confounding variables can't be ruled out by a χ² test alone.

❌ "Both tests use df = k−1"

Goodness of fit uses k−1, but the independence test uses (r−1)(c−1). For a 2×3 table, df = 2, not 5.

❌ "Smaller χ² is a better result"

A small χ² means the data fits the theory well — but that's "good" or "bad" depending on what you're testing. The test just measures agreement, not quality.

// Shapes you'll meet again

The same parts keep showing up whenever χ² tests are involved.

The shape of df: independence tests always carry (r−1)(c−1). For a 2×3 table that lands at (2−1)(3−1) = 2
The shape of expected frequencies: "row total × column total ÷ grand total" — the same picture appears as the expected value under independence
How χ² is built: each category's (O−E)²/E gets added up. "Square the gap, divide by what was expected, sum" is the same shape across goodness-of-fit and independence
The "E < 5" branch: the path forks toward an exact test for small frequencies or toward merging categories — this fork tags along whenever χ² is in play

Look up χ² critical values by df in the Interactive Distribution Tables

// Further reading

The Birthday Paradox — 50% with Just 23 People The complement-event picture exposes how badly intuition fails on probability

« See all columns

UP NEXT —comparing three or more groups at once ▸ I8 ANOVA