Day 13 — Stratified Sampling: The Smart Way to Sample

Divide and conquer your sampling strategy for maximum precision.

Stratified sampling guarantees coverage of important subgroups while reducing variance by 50-95% compared to simple random sampling.

The Random Sampling Trap

Imagine you're conducting a health survey in a company of 1,000 employees:

900 office workers (90%)
100 executives (10%)

You randomly sample 100 people. Here's what can go wrong:

Unlucky Sample #1:

Office workers: 95 people

Executives: 5 people

Problem: Only 5 executives - can't say much about this group!

Unlucky Sample #2:

Office workers: 87 people

Executives: 13 people

Different from reality (90/10 split)!

Unlucky Sample #3:

Office workers: 100 people

Executives: 0 people

Complete miss on executive health!

The problem: Simple Random Sampling (SRS) is... well, random!

The solution: Stratified Sampling - sample smartly within groups!

What is Stratified Sampling?

Stratified Sampling means:

Divide population into non-overlapping groups (strata)
Sample from each stratum separately
Combine results with proper weighting

Visual Comparison

Simple Random Sampling (SRS):

Show code (10 lines)

Population:  (office workers)

(executive)

Random sample of 10:

Picked:

Result: All office workers!

Stratified Sampling:

Show code (14 lines)

Population:

Stratum 1:  (90 office workers)

Stratum 2:  (10 executives)

Stratified sample of 10:

From Stratum 1:  (9 people)

From Stratum 2:  (1 person)

Result: Proper representation!

Why Stratify? Three Big Reasons

1. Guaranteed Coverage

Problem with SRS: Might miss rare but important groups

Example:

Show code (16 lines)

City population:

- Urban: 70%

- Suburban: 20%

- Rural: 10%

SRS of 100 might give:

Urban: 65, Suburban: 25, Rural: 10

OR

Urban: 75, Suburban: 22, Rural: 3  ← Rural underrepresented!

Stratified solution:

Show code (10 lines)

Explicitly sample from each:

Urban: 70 people (guaranteed)

Suburban: 20 people (guaranteed)

Rural: 10 people (guaranteed)

Coverage ensured!

2. Variance Reduction

The Math Intuition:

Variance comes from differences:

Between-stratum variance: How different are the groups?
Within-stratum variance: How different are people within each group?

The core idea: If strata are homogeneous (similar within), stratified sampling has lower variance than SRS!

Visual:

Show code (22 lines)

POPULATION (high variance):

Health scores: 45, 48, 50, 52, 85, 87, 88, 90, 91, 92

↑_________↑  ↑___________________↑

Office       Executives

(lower)      (higher)

Within-stratum variance:

Office: σ² = 6.5 (people similar)

Executive: σ² = 7.8 (people similar)

But between-stratum difference is HUGE (50 vs 90)!

SRS estimates affected by this big gap.

Stratified sampling accounts for it separately!

3. Domain Insights

SRS result:

"Average health score: 75"

Okay... but tells us nothing about groups!

Stratified result:

"Average health scores:

Office workers: 52 (95% CI: 50-54)

Executives: 88 (95% CI: 86-90)"

Rich insights about each segment!

The Math: How Much Better Is It?

Variance Formula

Simple Random Sampling variance:

Show code (12 lines)

Var(ȳ_SRS) = σ²/n × (N-n)/N

Where:

- σ² = overall population variance

- n = sample size

- N = population size

- (N-n)/N = finite population correction

Stratified Sampling variance:

Show code (12 lines)

Var(ȳ_strat) = Σ(Wₕ² × σₕ²/nₕ × (Nₕ-nₕ)/Nₕ)

Where:

- Wₕ = stratum h weight (Nₕ/N)

- σₕ² = variance within stratum h

- nₕ = sample size in stratum h

- Nₕ = population size in stratum h

The Variance Reduction:

Var(ȳ_SRS) - Var(ȳ_strat) = Σ Wₕ(μₕ - μ)²

This is the between-stratum variance!

Translation: The more different your strata are, the bigger the variance reduction!

Example Calculation

Population:

Stratum 1 (Office): N₁ = 900, μ₁ = 50, σ₁² = 100
Stratum 2 (Executive): N₂ = 100, μ₂ = 90, σ₂² = 64
Total: N = 1000

Sample: n = 100

Proportional allocation:

n₁ = 90 (90% of sample)
n₂ = 10 (10% of sample)

SRS Variance:

First, calculate overall variance:

Show code (12 lines)

μ = 0.9(50) + 0.1(90) = 45 + 9 = 54

σ² = 0.9(100 + (50-54)²) + 0.1(64 + (90-54)²)

= 0.9(100 + 16) + 0.1(64 + 1296)

= 0.9(116) + 0.1(1360)

= 104.4 + 136

= 240.4

Var(ȳ_SRS) = 240.4/100 × (1000-100)/1000

= 2.404 × 0.9

= 2.16

Standard error: √2.16 = 1.47

Stratified Variance:

Show code (16 lines)

W₁ = 900/1000 = 0.9

W₂ = 100/1000 = 0.1

Var(ȳ_strat) = 0.9² × (100/90) × (900-90)/900

+ 0.1² × (64/10) × (100-10)/100

= 0.81 × 1.11 × 0.9

+ 0.01 × 6.4 × 0.9

= 0.81 + 0.058

= 0.87

Standard error: √0.87 = 0.93

The Improvement:

Show code (10 lines)

Variance reduction: 2.16 - 0.87 = 1.29 (60% reduction! )

Standard error:

SRS: 1.47

Stratified: 0.93

Stratified is 58% more precise!

Translation: To get the same precision with SRS, you'd need 2.5× more samples!

Allocation Strategies: How Many Per Stratum?

Once you decide to stratify, how do you divide your sample across strata?

1. Proportional Allocation (Most Common)

Rule: Sample proportionally to stratum size

nₕ = n × (Nₕ/N)

Example:

Population: 900 office, 100 executive (1000 total)

Sample size: n = 100

Office sample: 100 × (900/1000) = 90

Executive sample: 100 × (100/1000) = 10

Pros:

Simple, intuitive
Self-weighting (no complex weights needed)
Represents population structure

Cons:

Small strata get small samples (might be imprecise)

2. Equal Allocation 🟰

Rule: Same sample size for each stratum

nₕ = n / H

Where H = number of strata

Example:

Show code (10 lines)

Population: 900 office, 100 executive

Sample size: n = 100

Strata: H = 2

Office sample: 100/2 = 50

Executive sample: 100/2 = 50

Pros:

Good for comparing strata (equal precision)
Ensures small strata have enough data

Cons:

Oversamples small strata (need complex weights)
Less efficient for overall mean estimation

3. Neyman Allocation (Optimal)

Rule: Allocate proportional to stratum size AND variance

nₕ = n × (Nₕ × σₕ) / Σ(Nₖ × σₖ)

Intuition: Sample more from:

Large strata (more people → more important)
High-variance strata (more diverse → need more samples)

Example:

Show code (14 lines)

Stratum 1: N₁ = 900, σ₁ = 10

Stratum 2: N₂ = 100, σ₂ = 8

Stratum 1 weight: 900 × 10 = 9,000

Stratum 2 weight: 100 × 8 = 800

Total weight: 9,800

Office sample: 100 × (9000/9800) = 91.8 ≈ 92

Executive sample: 100 × (800/9800) = 8.2 ≈ 8

Pros:

Mathematically optimal (minimizes variance!)
Accounts for both size and heterogeneity

Cons:

Requires knowing σₕ in advance (often unknown!)
Might still undersample important small strata

4. Optimal Allocation with Cost

Rule: Account for different sampling costs per stratum

nₕ = n × (Nₕ × σₕ / √cₕ) / Σ(Nₖ × σₖ / √cₖ)

Where cₕ = cost to sample one unit from stratum h

Example:

Executives cost 5× more to survey (busy, need incentives)

c₁ = $10 (office worker)

c₂ = $50 (executive)

This would reduce executive sample further!

Use when: Budget constrained, different costs per stratum

Visual: Variance vs Allocation

Let's see how variance changes with different allocations:

Show code (24 lines)

Variance (SE²)

3.0

•  SRS

2.5

2.0          • Equal

1.5

1.0                 • Proportional

0.5                          • Neyman

(Optimal!)

0.0

Different Allocation Strategies

Lower is better!

Takeaway: Neyman always wins (if you know the variances)!

Defining Strata: The Art and Science

Good strata are:

1. Mutually Exclusive

Each unit belongs to exactly one stratum

Bad: "Young", "Students"

(Young students counted twice!)

Good: "Student", "Non-Student"

2. Exhaustive

Every unit belongs to some stratum

Bad: "<30", "40-60", ">60"

(Missing 30-40 age range!)

Good: "<30", "30-40", "40-60", ">60"

3. Homogeneous Within 🟰

Units within stratum are similar

Bad stratum: "People" (too diverse!)

Good stratum: "Female doctors aged 40-50"

4. Heterogeneous Between

Strata are different from each other

Bad: "Age 30-40", "Age 31-41"

(Too much overlap, not distinct!)

Good: "Age 18-30", "Age 31-50", "Age 51+"

5. Meaningful

Based on domain knowledge, not arbitrary

Bad: "First 500 rows", "Last 500 rows"

(Arbitrary split!)

Good: "Urban", "Suburban", "Rural"

(Meaningful demographic divisions)

Common Stratification Variables:

Demographics:

Age groups
Gender
Education level
Income brackets
Geographic region

Business:

Customer segments (high/medium/low value)
Product categories
Time periods (Q1, Q2, Q3, Q4)

Medical:

Disease severity (mild/moderate/severe)
Treatment type
Risk factors present/absent

Wrapping Up

Stratified sampling is the "divide and conquer" of sampling:

Key Concepts:

Strata = non-overlapping, exhaustive groups

Proportional allocation = sample proportionally (simple, self-weighting)

Neyman allocation = optimal (proportional to Nₕ × σₕ)

Variance reduction = can be 50-95% lower than SRS!

Coverage guarantee = ensures rare groups included

Domain insights = separate estimates per stratum

The Math Win:

Variance reduction = Σ Wₕ(μₕ - μ)²

Translation: The more different your strata,

the bigger the improvement!

Allocation Decision Tree:

Show code (10 lines)

Do you know σₕ for each stratum?

Yes → Use Neyman (optimal!)

No → Do you need equal precision per stratum?

Yes → Use Equal allocation

No → Use Proportional (simplest)

Real Impact:

In our exercise, stratified sampling gave 16× more precision than SRS with the same sample size. That's like getting 129 samples for the price of 8!

Where This Shows Up in Practice

Data Pipelines: Ensuring high-quality filtering and robust statistical metrics before feeding downstream ML models.
Production Anomaly Detection: Tracking system logs, performance latencies, or transaction volumes under heavy skew.
A/B Testing & Evaluation: Correctly partitioning user cohorts or comparing treatment outcomes without normal distribution assumptions.

References

Cochran, W. G. (1977). Sampling Techniques (3rd ed.). John Wiley & Sons.
Lohr, S. L. (2019). Sampling: Design and Analysis (3rd ed.). Chapman and Hall/CRC.
Kish, L. (1965). Survey Sampling. John Wiley & Sons.
Neyman, J. (1934). On the two different aspects of the representative method: The method of stratified sampling and the method of purposive selection. Journal of the Royal Statistical Society, 97(4), 558-625.
Särndal, C. E., Swensson, B., & Wretman, J. (1992). Model Assisted Survey Sampling. Springer-Verlag.
Thompson, S. K. (2012). Sampling (3rd ed.). John Wiley & Sons.
Valliant, R., Dever, J. A., & Kreuter, F. (2018). Practical Tools for Designing and Weighting Survey Samples (2nd ed.). Springer.
Groves, R. M., Fowler, F. J., Couper, M. P., Lepkowski, J. M., Singer, E., & Tourangeau, R. (2009). Survey Methodology (2nd ed.). John Wiley & Sons.
Little, R. J., & Rubin, D. B. (2019). Statistical Analysis with Missing Data (3rd ed.). John Wiley & Sons.

Kicking off...

Navigation

Topics

Connect

Day 13 — Stratified Sampling: The Smart Way to Sample

TL;DR