Hypothesis Testing — Reject Regions & p-values

Hypothesis testing has the structure of a courtroom. H₀ is presumed innocent; the rejection region is where conviction begins — z just shows which side it falls on.

I.04 / HYPOTHESIS TESTING

Hypothesis Testing — Reject or Fail to Reject

If a CI expresses uncertainty as a width, hypothesis testing turns it into a yes/no decision. Under the null world, could this data have happened? If it's too unlikely, reject. Same distribution, same σ/n — just a different question.

Think of testing as a trial.
You start by assuming H₀ ("the drug has no effect" = "innocent"). Then if your computed test statistic z lands in the pre-chosen rejection region, you convict — that is, reject H₀.
Two panels below: ① geometry of z and rejection regions (two-sided, right, left), and ② false alarms (α) vs. misses (β).

Experiment Guide — try these in order

Step 1: Panel ①: z=1.96, α=0.05, two-sided → right on the boundary. p≈0.05. The watershed.
α (significance level) = the threshold for "too extreme to be coincidence." It sets the width of the rejection zone.
Step 2: Push z to 2.5 → deep in the rejection zone, p-value shrinks. "Strong evidence."
Step 3: Switch to "Right" test → same z=1.96 but rejection area is one-sided; p-value halves.

▶ ① Basics: z-statistic & rejection region

Observed z = 1.96

α = 0.05

Test type

Test statistic z—

Critical value—

p-value—

Decision—

Experiment Guide — α, β & Power

Step 1: Set effect size δ to 0 → H₁ overlaps H₀ completely, power drops to zero. No difference = can't detect.
Step 2: Slide δ from 2 to 3 → the purple (H₁) curve separates, power climbs.
Step 3: Lower α to 0.01 → fewer false alarms, but more misses (β rises). Feel the trade-off.

▶ ② Two kinds of errors: α, β, power

This panel is a hands-on playground for the α (Type I) / β (Type II) / power trade-off.
Come back whenever you want to feel the trade-off. For the deeper conceptual write-up (the 2×2 matrix, why α and β live in different worlds), see the "Type I & II in a 2×2 table" column.
Tip: drag horizontally on the chart to slide the critical boundary (α).

Effect size δ = 2.0

α = 0.050

α (Type I error)—

Critical value—

β (Type II error)—

Power 1−β—

Interactive Distribution Tables — look up critical z/t values with live graph sync

// Formula used here

Left formula in plain English
• "If this drug truly has no effect (H₀: μ = μ₀), how many standard errors is our sample mean from μ₀?"

Each part
• X̄ − μ₀: the gap between observed and "no effect" — bigger means more suspicious
• ÷ σ/√n: converts to "how many SEs?" — 1 SE is normal, 3 SEs is extremely rare

Right formula: p-value (two-sided)
• "Assuming H₀ is true, the probability of seeing a |z| this extreme or more" — that's what we want to know
• p-value ≤ α (significance level) → reject. "A result this extreme is too unlikely to be chance"
• p-value > α → fail to reject. "This could easily happen by chance"

⚠️ p-value ≠ "probability that H₀ is true"
• The p-value is computed assuming H₀ is true. It cannot tell you whether H₀ is actually true or false
• This is a common point of confusion

// The 5-step testing recipe — use this order every time

State hypotheses: H₀: "no difference" vs. H₁: "there is a difference"
Choose α before collecting data (e.g., 0.05)
Compute the test statistic: z, t, χ², etc. — a single number summarizing "how far the data is from H₀"
Find the p-value: look up the probability from the statistic
Decide: p ≤ α → reject; p > α → fail to reject

// The courtroom analogy — going deeper

H₀ = "defendant is innocent"; H₁ = "defendant is guilty"
Rejecting H₀ = guilty verdict: "it's implausible that an innocent person would produce this evidence"
Failing to reject = insufficient evidence: not "proven innocent," just "not enough to convict"
That's why we say "fail to reject H₀" rather than "accept H₀"

// Connection to confidence intervals

μ₀ inside the 95% CI ⟺ fail to reject at α = 0.05. Same math, different lens: the CI shows "width," the test gives "yes/no." The conclusions always agree.

// The α-β trade-off

For a full breakdown of α, β, and power, see the "Type I & II in a 2×2 table" column.

// Common misconceptions

❌ "Small p-value = large effect"

The p-value measures how surprising the data is, not how big the effect is. With n = 1,000,000 a tiny difference can yield p < 0.001. Always check effect size separately.

Misconceptions about errors and power ("not significant = no difference," "smaller α is always better," etc.) are collected in the FAQ of the "Type I & II in a 2×2 table" column.

// Shapes you'll meet again

Across hypothesis tests, the same cast keeps returning: comparing p with α, choosing one-sided vs. two-sided, and the 2×2 of error types.

The p-value-vs-α comparison: with p = 0.03, α = 0.01, the relation 0.03 > 0.01 lands on "fail to reject." "Is the observed p inside α or not?" is the recurring decision shape
One-sided vs. two-sided sorting: "does the drug raise the mean?" calls for one-sided; "does it change in either direction?" calls for two-sided. The question's orientation forks into this shape
The 2×2 of errors: rejecting a true H₀ is Type I, failing to reject a false H₀ is Type II. Wherever a test lives, this 2×2 table tags along
Where power sits: 1 − β = "the probability of detecting a real difference when one exists." It appears in this same shape, as the flip side of Type II error

Look up critical values in the Interactive Distribution Tables — z and t values with live graph sync

// Further reading

Type I vs Type II Errors — One 2×2 Table Sorts Them Out α, β, power, and effect size all live on a single 2×2 picture — confusion disappears

« See all columns

UP NEXT —from means to proportions ▸ I5 Proportion Test