Sughosh Dixit
Sughosh P Dixit
2025-11-28 β€’ 8 min read

Day 28: Robust Imputation and Numeric Coercion

Article Header Image

TL;DR

Quick summary

Understand how numeric coercion and NA handling affect data distributions. Learn the impact of different imputation strategies on mean, variance, and quantiles for threshold-based rule evaluation.

Key takeaways
  • Day 28: Robust Imputation and Numeric Coercion
Preview

Day 28: Robust Imputation and Numeric Coercion

Understand how numeric coercion and NA handling affect data distributions. Learn the impact of different imputation strategies on mean, variance, and quantiles for threshold-based rule evaluation.

Day 28: Robust Imputation and Numeric Coercion πŸ”§πŸ“Š

Understand how data preprocessing choices affect your distributions and downstream thresholds.

Imputation and coercion choices directly impact statistical measures and threshold calibrationβ€”choose wisely.

Before rules can be evaluated, data must be clean. How you handle missing values and convert data types has profound effects on the distributions your thresholds are based on.

πŸ’‘ Note: This article uses technical terms and abbreviations. For definitions, check out the Key Terms & Glossary page.


The Problem: Missing Data and Type Mismatches 🎯

Scenario: Your fraud detection pipeline receives raw data with:

  • Missing values (NA, NULL, "")
  • Mixed types ("100", 100, "N/A")
  • Invalid entries ("error", -999)

Questions:

  1. How do you convert everything to numeric?
  2. What do you replace missing values with?
  3. How do these choices affect thresholds?

The answer matters more than you think! πŸ€”


Numeric Coercion πŸ”’

What is Coercion?

Coercion is converting data from one type to another.

Common scenarios:

  • String β†’ Number: "100" β†’ 100
  • Number β†’ Integer: 100.7 β†’ 100 or 101
  • Invalid β†’ NA: "error" β†’ NA

Coercion in Practice

Show code (9 lines)
def safe_numeric(value, default=None):
    """
    Safely convert to numeric with fallback.
    """
    try:
        return float(value)
    except (ValueError, TypeError):
        return default

Edge Cases

What happens with:

safe_numeric("100")      # β†’ 100.0
safe_numeric("$100")     # β†’ None (or clean first)
safe_numeric("")         # β†’ None
safe_numeric(None)       # β†’ None
safe_numeric("1e10")     # β†’ 10000000000.0
safe_numeric("NaN")      # β†’ nan (careful!)

Visual Example:

Coercion Examples

Coercion must handle edge cases gracefully to prevent downstream errors in rule evaluation.


Imputation Strategies 🩹

Strategy 1: Impute with Zero

Method: Replace NA with 0

df['amount'] = df['amount'].fillna(0)

When appropriate:

  • NA genuinely means "no activity"
  • Zero is a valid value in the domain
  • You want to flag non-activity

Risks:

  • Shifts distribution left
  • Inflates zero-count
  • Can dramatically change quantiles

Strategy 2: Impute with Mean

Method: Replace NA with the column mean

df['amount'] = df['amount'].fillna(df['amount'].mean())

When appropriate:

  • Missing at random (MAR)
  • Want to preserve mean
  • Large sample sizes

Risks:

  • Reduces variance
  • Can distort relationships
  • Ignores missingness pattern

Strategy 3: Impute with Median

Method: Replace NA with the column median

df['amount'] = df['amount'].fillna(df['amount'].median())

When appropriate:

  • Skewed distributions
  • Outliers present
  • Want robust central tendency

Risks:

  • Still reduces variance
  • Ignores missingness pattern
  • May create spike at median

Strategy 4: Keep as NA (Exclude)

Method: Leave missing and handle separately

df_clean = df.dropna(subset=['amount'])

When appropriate:

  • Missingness is informative
  • Separate rules for missing
  • Small fraction missing

Risks:

  • Reduces sample size
  • Selection bias if not MCAR

Visual Example:

Imputation Strategies

Impact on Distribution Statistics πŸ“ˆ

Mean

Original: ΞΌ = Ξ£x_i / n

After zero imputation:

  • Mean decreases (zeros pull it down)
  • Effect proportional to missing rate

After mean imputation:

  • Mean preserved (by design)
  • But artificial values added

Variance

Original: σ² = Ξ£(x_i - ΞΌ)Β² / n

After zero imputation:

  • Variance increases if mean > 0
  • Creates bimodal distribution

After mean imputation:

  • Variance decreases
  • Imputed values have zero deviation

Formula for variance reduction:

σ²_new β‰ˆ σ²_old Γ— (1 - missing_rate)

Quantiles

Zero imputation effect:

Original: [10, 20, 30, 40, 50]  β†’ p50 = 30
With 20% zeros: [0, 10, 20, 30, 40, 50]  β†’ p50 = 25 (shifted!)

Median imputation effect:

Original: [10, 20, 30, 40, 50]  β†’ p50 = 30
With imputed 30s: [10, 20, 30, 30, 30, 40, 50]  β†’ p50 = 30 (preserved)
But p75 = 30 (distorted!)

Visual Example:

Distribution Impact

Different imputation methods shift distributions in different waysβ€”understand the trade-offs.


Rule Geometry Changes πŸ“

How fillna(0) Changes Rules

Before imputation:

Rule: IF amount > threshold THEN flag

Data: [10, 20, NA, 40, 50]
Threshold (p50 excluding NA): 30
Flagged: [40, 50] (2 values)

After fillna(0):

Data: [10, 20, 0, 40, 50]
Threshold (p50): 20  ← Lower!
Flagged: [40, 50] (still 2, but threshold changed)

Geometric Interpretation

In feature space:

  • Zero imputation adds points at the origin
  • This shifts decision boundaries
  • Thresholds based on quantiles move

Example: 2D space

Before: Points scattered in positive quadrant
After: Cluster of zeros at (0, 0)

Decision boundary must now separate:
- True zeros (valid)
- Imputed zeros (missing)
- Positive values

Visual Example:

Rule Geometry Changes

Histograms: Before and After πŸ“Š

Visual Comparison

Original distribution (no missing):

      β”‚    β–„β–„
      β”‚   β–ˆβ–ˆβ–ˆβ–ˆ
      β”‚  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
      β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
      β”‚β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
      └──────────────
       0   20   40   60

After zero imputation (20% missing):

Show code (10 lines)
      β”‚β–ˆ
      β”‚β–ˆ   β–„β–„
      β”‚β–ˆ  β–ˆβ–ˆβ–ˆβ–ˆ
      β”‚β–ˆ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
      β”‚β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
      └──────────────
       0   20   40   60
       ↑
      Spike!

After mean imputation:

Show code (10 lines)
      β”‚    β–ˆβ–„
      β”‚   β–ˆβ–ˆβ–ˆβ–ˆ
      β”‚  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
      β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
      β”‚β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
      └──────────────
       0   20   40   60
           ↑
         Spike at mean

Visual Example:

Histograms Before/After

Exercise: Quantifying Percentile Shift πŸŽ“

The Problem

Given: A skewed sample with 20% missing values.

Original data (no missing):

[5, 10, 15, 20, 25, 30, 50, 100, 150, 200]

Task: Quantify how replacing 2 random values with NA, then imputing with 0, shifts the 95th percentile.

Solution

Step 1: Original 95th Percentile

data = [5, 10, 15, 20, 25, 30, 50, 100, 150, 200]
p95_original = np.percentile(data, 95)
# p95_original β‰ˆ 187.5 (interpolated)

Step 2: Remove 2 Values (20% missing)

Suppose we remove indices 3 and 7 (values 20 and 100):

data_with_na = [5, 10, 15, NA, 25, 30, 50, NA, 150, 200]

Step 3: Impute with Zero

data_imputed = [5, 10, 15, 0, 25, 30, 50, 0, 150, 200]
Sorted: [0, 0, 5, 10, 15, 25, 30, 50, 150, 200]

Step 4: New 95th Percentile

p95_imputed = np.percentile(data_imputed, 95)
# p95_imputed β‰ˆ 187.5 (high percentile less affected)

Step 5: Effect on Lower Percentiles

p50_original = np.percentile(data, 50)  # β‰ˆ 27.5
p50_imputed = np.percentile(data_imputed, 50)  # β‰ˆ 20

Shift: 27.5 - 20 = 7.5 (27% reduction!)

Key Insights

  1. High percentiles (90th, 95th): Less affected by zero imputation
  2. Median (50th): Significantly shifted down
  3. Lower percentiles (10th, 25th): Pushed to zero
  4. Effect is proportional to: Missing rate and zero distance from median

Visual Example:

Exercise Percentile Shift

Zero imputation primarily affects lower and middle percentiles, with diminishing effect on extreme upper percentiles.


Best Practices for Imputation βœ…

1. Understand Your Missing Data

# Analyze missing pattern
missing_rate = df.isna().mean()
missing_correlation = df.isna().corr()

2. Document Your Strategy

IMPUTATION_CONFIG = {
    'amount': {'method': 'zero', 'reason': 'NA means no transaction'},
    'score': {'method': 'median', 'reason': 'Preserve central tendency'},
    'count': {'method': 'exclude', 'reason': 'Analyze separately'},
}

3. Compare Before and After

def imputation_impact_report(original, imputed, percentiles=[25, 50, 75, 90, 95]):
    report = {}
    for p in percentiles:
        orig = np.percentile(original.dropna(), p)
        imp = np.percentile(imputed, p)
        report[f'p{p}'] = {'original': orig, 'imputed': imp, 'shift': imp - orig}
    return report

4. Consider Separate Rules for Missing

if pd.isna(value):
    return apply_missing_rule(event)
else:
    return apply_normal_rule(event, value)

5. Monitor Drift

Track imputation effects over time as missing patterns change.

6. Use Robust Statistics

When possible, use median-based methods that resist imputation artifacts.


Summary Table πŸ“‹

Strategy Effect on Mean Effect on Variance Effect on Quantiles
Zero ↓ Decreases ↑ Increases ↓ Lower percentiles drop
Mean = Preserved ↓ Decreases ~ Middle compressed
Median ~ Slight shift ↓ Decreases = Median preserved
Exclude ? Depends ? Depends ? Depends on MCAR

Final Thoughts 🌟

Imputation and coercion are not neutral operationsβ€”they actively shape your data distribution:

  • Zero imputation creates spikes and shifts percentiles down
  • Mean/median imputation compresses variance
  • Exclusion may introduce selection bias

Key Takeaways:

βœ… Coercion must handle edge cases gracefully βœ… Zero imputation shifts lower percentiles significantly βœ… Mean imputation reduces variance artificially βœ… Median imputation is more robust but still affects distribution βœ… fillna(0) changes rule geometry by adding origin points βœ… Document and monitor your imputation choices

Clean data, clear thresholds! πŸ”§πŸŽ―

Tomorrow's Preview: Day 29 - Putting It All Together: Constructing a Stratified Audit Plan πŸ“‹πŸŽ―


Sughosh P Dixit
Sughosh P Dixit
Data Scientist & Tech Writer
8 min read
Previous Post

Day 27: Quantile Stability, Ties, and Small Samples

Master practical considerations for computing empirical quantiles. Understand how ties, discrete samples, and different interpolation schemes affect quantile estimates and threshold repeatability.

Next Post

Day 29: Putting It All Together - Constructing a Stratified Audit Plan

Synthesize everything from quantile thresholds to strata to sample sizes. Learn to construct a complete stratified audit plan with cutoffs, sample sizes, and investigation workflows.