Sughosh Dixit
Sughosh P Dixit
2025-11-047 min read

Day 4 — Percentile Rank and Stratifications

Article Header Image

TL;DR

Quick summary

Percentile ranks turn any numeric feature into a simple score in [0,1] that says 'what fraction of the data is at or below this value.' Learn how to combine ranks with min/max and create strata for sampling, prioritization, or analysis.

Key takeaways
  • Day 4 — Percentile Rank and Stratifications
Preview

Day 4 — Percentile Rank and Stratifications

Percentile ranks turn any numeric feature into a simple score in [0,1] that says 'what fraction of the data is at or below this value.' Learn how to combine ranks with min/max and create strata for sampling, prioritization, or analysis.

Day 4 — Percentile Rank and Stratification (with Solved Examples) 📈🎯

Ranking and stratifying data for insights! 📊

💡 Note: This article uses technical terms and abbreviations. For definitions, check out the Key Terms & Glossary page.


🎯 Introduction

Percentile ranks provide a powerful way to normalize features onto a common scale, making it easy to combine multiple features and create meaningful stratifications for analysis and sampling.

TL;DR:

Percentile ranks turn any numeric feature into a simple score in [0,1] that says "what fraction of the data is at or below this value."

Because percentile ranks are order-based, they are stable under rescaling and other monotone transforms.

If you combine several features' ranks using min (for AND-like behavior) or max (for OR-like behavior), you get a single score that you can cut into strata (e.g., quartiles/deciles) for sampling, prioritization, or analysis. 💡

Percentile Rank Concept


🧮 Percentile Rank: A Simple [0,1] Scale

  • Given a feature X and a dataset of size n, the percentile rank of a value xᵢ is:

    📊 rankᵢ = Fₙ(xᵢ) = (1/n) × (# of values ≤ xᵢ)

  • This maps each observation to a number between 0 and 1 (inclusive).

  • ✨ Properties you get for free:

    • 🔼 Monotonicity (isotonicity): If xᵢ ≤ xⱼ, then rankᵢ ≤ rankⱼ.

    • 🔁 Invariance to monotone transforms: If f is strictly increasing (e.g., a·x+b with a>0), ranks don't change.

💡 Why this matters: Different features may have different scales or units. Percentile ranks put everything onto the same comparable [0,1] scale, making multi-feature logic easy to reason about.

Percentile Rank Properties


⚙️ Combining percentile ranks across features

If you have features A and B with ranks rA and rB:

  • 🔒 Conservative (AND-like) combination:

    rAND = min(rA, rB)

    → Combined score limited by the weaker (smaller) rank. Safe way to demand "both high."

  • 🌈 Liberal (OR-like) combination:

    rOR = max(rA, rB)

    → Combined score benefits from the stronger (larger) rank. Allows "either high."

These match Day 1's logic mapping:

🧩 AND ≈ min OR ≈ max

They're simple, monotone, and explainable ✅

Combining Ranks


🧭 Stratification from Combined Ranks

Once you have a single combined rank per observation (e.g., rAND), split the population into strata:

  • 🔢 Deciles: [0.0, 0.1, 0.2, …, 0.9, 1.0]

  • 🧮 Quartiles: [0.0, 0.25, 0.5, 0.75, 1.0]

  • ⚖️ Custom cuts: e.g., [0.0, 0.2, 0.5, 0.8, 1.0]

Use strata to:

  • 🎯 Draw balanced samples

  • 🚦 Prioritize reviews/interventions

  • 📊 Report performance metrics by difficulty bands

Because ranks are monotone-invariant, your strata stay meaningful even if raw features change scale.

Stratification Concept


🧩 Solved example — From raw features to strata

id A B
1 10 5
2 7 3
3 12 9
4 15 4
5 8 8
6 20 6
7 9 2
8 18 10

Step 1️⃣: Compute percentile ranks per feature

Use the empirical CDF (rank = i/n). Example (rounded):

A rankA B rankB
7 0.12 2 0.12
8 0.25 3 0.25
9 0.38 4 0.38
10 0.50 5 0.50
12 0.62 6 0.62
15 0.75 8 0.75
18 0.88 9 0.88
20 1.00 10 1.00

Step 2️⃣: Combine with min (AND-like) and max (OR-like)

id A B rA rB rAND = min(rA,rB) rOR = max(rA,rB)
1 10 5 0.50 0.50 0.50 0.50
2 7 3 0.12 0.25 0.12 0.25
3 12 9 0.62 0.88 0.62 0.88
4 15 4 0.75 0.38 0.38 0.75
5 8 8 0.25 0.75 0.25 0.75
6 20 6 1.00 0.62 0.62 1.00
7 9 2 0.38 0.12 0.12 0.38
8 18 10 0.88 1.00 0.88 1.00

Step 3️⃣: Create strata from rAND

Cuts at 0.2, 0.5, 0.8 →

  • 🩵 Stratum 1: rAND < 0.2 → ids 2, 7

  • 💙 Stratum 2: 0.2 ≤ rAND < 0.5 → ids 5, 4

  • 💜 Stratum 3: 0.5 ≤ rAND < 0.8 → ids 1, 3, 6

  • 💛 Stratum 4: rAND ≥ 0.8 → id 8

📈 Using min is conservative: an observation only scores high if both A and B are high.

Using max is liberal — more points rise into higher strata.

Solved Example Visualization


💭 Why "min of ranks" is conservative

  • rAND = min(rA, rB) can never exceed either input.

  • Raising any rank can only lift (not drop) rAND.

  • Thus, demanding high rAND ≈ saying "both inputs are high." ✅

Min vs Max Comparison


🖼️ Visual ideas

  • 🧊 Two 2-D heatmaps:

    1️⃣ rA (x-axis = id, y-axis = rank) and rB similarly.

    2️⃣ The combined min(rA,rB) mesh — shows the "AND valley."

  • 📊 Simple bar chart: rAND per id colored by stratum.

Visual Heatmaps


🧠 Practical tips

  • 📦 Compute ranks per group for fair comparison (region/time).

  • 🧾 Handle ties consistently (average ranks).

  • 🪜 Pre-decide strata cuts (deciles, quartiles, custom).

  • 🔄 Extend beyond two features:

    • rAND = min(r₁,…,rₖ)

    • rOR = max(r₁,…,rₖ)


🌟 Takeaway

Percentile ranks normalize features onto a common [0,1] scale.

Combining them with min (AND) or max (OR) gives an interpretable, monotone score ideal for sampling, prioritization, and reporting.

Simple ✅ Robust 🧩 Explainable 💡


📚 References

  1. Hyndman, R. J., & Fan, Y. (1996). Sample quantiles in statistical packages. The American Statistician, 50(4), 361-365.

  2. Serfling, R. J. (2009). Approximation Theorems of Mathematical Statistics. John Wiley & Sons.

  3. Mosteller, F., & Tukey, J. W. (1977). Data Analysis and Regression: A Second Course in Statistics. Addison-Wesley.

  4. Hoaglin, D. C., Mosteller, F., & Tukey, J. W. (Eds.). (1983). Understanding Robust and Exploratory Data Analysis. John Wiley & Sons.

  5. David, H. A., & Nagaraja, H. N. (2003). Order Statistics (3rd ed.). John Wiley & Sons.

  6. Parzen, E. (1979). Nonparametric statistical data modeling. Journal of the American Statistical Association, 74(365), 105-121.

  7. Koenker, R. (2005). Quantile Regression. Cambridge University Press.

  8. Langford, E. (2006). Quartiles in elementary statistics. Journal of Statistics Education, 14(3).

  9. Hyndman, R. J. (1996). Computing and graphing highest density regions. The American Statistician, 50(2), 120-126.

  10. Cramér, H. (1946). Mathematical Methods of Statistics. Princeton University Press.


Day 4 Complete! 🎉

This is Day 4 of my 30-day challenge documenting my Data Science journey at Oracle! Stay tuned for more insights and mathematical foundations of data science. 🚀

Next: Day 5 - Coming Tomorrow!
Sughosh P Dixit
Sughosh P Dixit
Data Scientist & Tech Writer
7 min read
Previous Post

Day 30: A Mathematical Blueprint for Robust Decision Frameworks

A comprehensive mathematical summary mapping nonparametric statistics, robust measures, sampling theory, decision metrics, set operations, and fuzzy aggregation to their pipeline implementations.

Next Post

Day 5 — Robust Location and Scale: Median & MAD (Simple Guide + Worked Example)

The mean and standard deviation (SD) can be swayed by outliers like reeds in the wind — a single extreme value can pull them off course. The median and MAD (Median Absolute Deviation), on the other hand, are sturdy rocks in the statistical stream. They resist distortion and give reliable 'center' and 'spread' estimates, even when your data are skewed or heavy-tailed.