Chi-Squared Test — Goodness-of-Fit & Independence
Whether a die is fair, whether two categorical variables are independent — the same formula answers both. Observed minus expected, compressed into one number.
Chi-Squared Test — Quantifying the Gap
Goodness-of-fit asks: "Does the observed category distribution match a theoretical one?" Classic example: is the die fair?
Test of independence asks: "Are two categorical variables independent?" Compute χ² = Σ (O−E)²/E across every cell of the contingency table.
Why divide by E? → A deviation of 2 from an expected 10 matters more than 2 from an expected 1,000. Dividing by E turns raw gaps into relative ones.
Both use a χ²-distributed statistic; the p-value is the right-tail area. df = k−1 for goodness-of-fit, (r−1)(c−1) for independence.
- Step 1: Select "Fair die" and ▶ auto-roll → as n grows, χ² stays low, p-value stays high.
- Step 2: Switch to "Loaded" and auto-roll → χ² shoots up, p-value drops below α. Cheat detected.
- Step 3: Lower α to 0.01 → stricter threshold. Small deviations won't reject.
▶ ① Goodness-of-Fit — Is the Die Fair?
- Step 1: Click only the top-left cell to create imbalance → χ² spikes, independence is rejected.
- Step 2: ↺ Reset and click cells evenly → χ² stays small. No imbalance = independence holds.
- Step 3: Change α to see "how much imbalance is needed to reject independence."
▶ ② Test of Independence — Are Two Variables Independent?
// Formula used here — Goodness of fit
Worked example (120 rolls of a die)
• A fair die should give E = 20 per face
• Actual results: [25, 17, 15, 23, 22, 18]
• Face 1: (25−20)²/20 = 1.25
• Face 2: (17−20)²/20 = 0.45 … sum all → χ² = 3.80
• df = 6−1 = 5, critical value at α = 0.05 is 11.07. Since 3.80 < 11.07 → fail to reject. Consistent with a fair die
Each part
• O: observed frequency (what you actually counted)
• E: expected frequency (what the theory predicts)
• (O−E)²/E: using raw (O−E)² would unfairly penalize large categories. Dividing by E gives "relative departure from expectation"
• Sum across all categories → a single number measuring overall fit
// Formula used here — Independence test
Worked example (2×2 table: gender × smoking)
• 200 men (60 smokers), 300 women (60 smokers), 500 total
• If gender and smoking are independent, expected male smokers?
• E = 200 × 120/500 = 48 ← row total × column total ÷ grand total
• "Under independence we'd expect 48, but observed 60" → the larger the gap, the stronger the evidence against independence
Why df = (r−1)(c−1)
• In a 2×2 table: once row and column totals are fixed, choosing one cell determines the rest → df = 1
• A 3×4 table: df = 2 × 3 = 6
Caution: expected counts below 5
• The χ² approximation becomes unreliable. Rule of thumb: if more than 20% of cells have E < 5, use an exact test for small frequencies or merge categories
// Goodness of fit vs. independence test — what's the difference?
Both use χ² = Σ(O−E)²/E, but the question is different:
- Goodness of fit: "Is this die fair?" "Does this data follow a normal distribution?" → compares one variable to a theoretical distribution
- Independence test: "Is gender associated with smoking?" → examines the relationship between two variables
Expected frequencies are computed differently:
- Goodness of fit: directly from theory (e.g., n/6 for a fair die)
- Independence: row total × column total ÷ grand total (back-calculated from the independence assumption)
// Why always a right-tail test?
χ² values are sums of squared deviations, so they're always ≥ 0. Larger values mean worse fit.
- Large χ² → data doesn't match the theory → reject
- Small χ² → good fit → fail to reject
- The rejection region is always in the right tail. There's no left-tail rejection
// Common misconceptions
All it can say is "there's an association." "Smoking is associated with lung cancer" ≠ "smoking causes lung cancer." Confounding variables can't be ruled out by a χ² test alone.
Goodness of fit uses k−1, but the independence test uses (r−1)(c−1). For a 2×3 table, df = 2, not 5.
A small χ² means the data fits the theory well — but that's "good" or "bad" depending on what you're testing. The test just measures agreement, not quality.
// Shapes you'll meet again
The same parts keep showing up whenever χ² tests are involved.
- The shape of df: independence tests always carry (r−1)(c−1). For a 2×3 table that lands at (2−1)(3−1) = 2
- The shape of expected frequencies: "row total × column total ÷ grand total" — the same picture appears as the expected value under independence
- How χ² is built: each category's (O−E)²/E gets added up. "Square the gap, divide by what was expected, sum" is the same shape across goodness-of-fit and independence
- The "E < 5" branch: the path forks toward an exact test for small frequencies or toward merging categories — this fork tags along whenever χ² is in play
Look up χ² critical values by df in the Interactive Distribution Tables