Sughosh Dixit
Sughosh P Dixit
2025-11-288 min read

Day 28: Robust Imputation and Numeric Coercion

Article Header Image

TL;DR

Quick summary

Understand how numeric coercion and NA handling affect data distributions. Learn the impact of different imputation strategies on mean, variance, and quantiles for threshold-based rule evaluation.

Key takeaways
  • Day 28: Robust Imputation and Numeric Coercion
Preview

Day 28: Robust Imputation and Numeric Coercion

Understand how numeric coercion and NA handling affect data distributions. Learn the impact of different imputation strategies on mean, variance, and quantiles for threshold-based rule evaluation.

Day 28: Robust Imputation and Numeric Coercion

Understand how data preprocessing choices affect your distributions and downstream thresholds.

Before rules can be evaluated, data must be clean. How you handle missing values and convert data types has profound effects on the distributions your thresholds are based on.

Note: This article uses technical terms and abbreviations. For definitions, check out the Key Terms & Glossary page.


The Problem: Missing Data and Type Mismatches

Scenario: Your fraud detection pipeline receives raw data with:

  • Missing values (NA, NULL, "")
  • Mixed types ("100", 100, "N/A")
  • Invalid entries ("error", -999)

Questions:

  1. How do you convert everything to numeric?
  2. What do you replace missing values with?
  3. How do these choices affect thresholds?

The answer matters more than you think!


Numeric Coercion

What is Coercion?

Coercion is converting data from one type to another.

Common scenarios:

  • String → Number: "100" → 100
  • Number → Integer: 100.7 → 100 or 101
  • Invalid → NA: "error" → NA

Coercion in Practice

Show code (9 lines)
def safe_numeric(value, default=None):
"""
Safely convert to numeric with fallback.
"""
try:
return float(value)
except (ValueError, TypeError):
return default

Edge Cases

What happens with:

safe_numeric("100")      # → 100.0
safe_numeric("$100")     # → None (or clean first)
safe_numeric("")         # → None
safe_numeric(None)       # → None
safe_numeric("1e10")     # → 10000000000.0
safe_numeric("NaN")      # → nan (careful!)

Visual Example:

Coercion Examples

Imputation Strategies

Strategy 1: Impute with Zero

Method: Replace NA with 0

df['amount'] = df['amount'].fillna(0)

When appropriate:

  • NA genuinely means "no activity"
  • Zero is a valid value in the domain
  • You want to flag non-activity

Risks:

  • Shifts distribution left
  • Inflates zero-count
  • Can dramatically change quantiles

Strategy 2: Impute with Mean

Method: Replace NA with the column mean

df['amount'] = df['amount'].fillna(df['amount'].mean())

When appropriate:

  • Missing at random (MAR)
  • Want to preserve mean
  • Large sample sizes

Risks:

  • Reduces variance
  • Can distort relationships
  • Ignores missingness pattern

Strategy 3: Impute with Median

Method: Replace NA with the column median

df['amount'] = df['amount'].fillna(df['amount'].median())

When appropriate:

  • Skewed distributions
  • Outliers present
  • Want robust central tendency

Risks:

  • Still reduces variance
  • Ignores missingness pattern
  • May create spike at median

Strategy 4: Keep as NA (Exclude)

Method: Leave missing and handle separately

df_clean = df.dropna(subset=['amount'])

When appropriate:

  • Missingness is informative
  • Separate rules for missing
  • Small fraction missing

Risks:

  • Reduces sample size
  • Selection bias if not MCAR

Visual Example:

Imputation Strategies

Impact on Distribution Statistics

Mean

Original: μ = Σx_i / n

After zero imputation:

  • Mean decreases (zeros pull it down)
  • Effect proportional to missing rate

After mean imputation:

  • Mean preserved (by design)
  • But artificial values added

Variance

Original: σ² = Σ(x_i - μ)² / n

After zero imputation:

  • Variance increases if mean > 0
  • Creates bimodal distribution

After mean imputation:

  • Variance decreases
  • Imputed values have zero deviation

Formula for variance reduction:

σ²_new ≈ σ²_old × (1 - missing_rate)

Quantiles

Zero imputation effect:

Original: [10, 20, 30, 40, 50]  → p50 = 30
With 20% zeros: [0, 10, 20, 30, 40, 50]  → p50 = 25 (shifted!)

Median imputation effect:

Original: [10, 20, 30, 40, 50]  → p50 = 30
With imputed 30s: [10, 20, 30, 30, 30, 40, 50]  → p50 = 30 (preserved)
But p75 = 30 (distorted!)

Visual Example:

Distribution Impact

Rule Geometry Changes

How fillna(0) Changes Rules

Before imputation:

Rule: IF amount > threshold THEN flag

Data: [10, 20, NA, 40, 50]
Threshold (p50 excluding NA): 30
Flagged: [40, 50] (2 values)

After fillna(0):

Data: [10, 20, 0, 40, 50]
Threshold (p50): 20  ← Lower!
Flagged: [40, 50] (still 2, but threshold changed)

Geometric Interpretation

In feature space:

  • Zero imputation adds points at the origin
  • This shifts decision boundaries
  • Thresholds based on quantiles move

Example: 2D space

Before: Points scattered in positive quadrant
After: Cluster of zeros at (0, 0)

Decision boundary must now separate:
- True zeros (valid)
- Imputed zeros (missing)
- Positive values

Visual Example:

Rule Geometry Changes

Histograms: Before and After

Visual Comparison

Original distribution (no missing):


0   20   40   60

After zero imputation (20% missing):


0   20   40   60
↑
Spike!

After mean imputation:


0   20   40   60
↑
Spike at mean

Visual Example:

Histograms Before/After

Exercise: Quantifying Percentile Shift

The Problem

Given: A skewed sample with 20% missing values.

Original data (no missing):

[5, 10, 15, 20, 25, 30, 50, 100, 150, 200]

Task: Quantify how replacing 2 random values with NA, then imputing with 0, shifts the 95th percentile.

Solution

Step 1: Original 95th Percentile

data = [5, 10, 15, 20, 25, 30, 50, 100, 150, 200]
p95_original = np.percentile(data, 95)
# p95_original ≈ 187.5 (interpolated)

Step 2: Remove 2 Values (20% missing)

Suppose we remove indices 3 and 7 (values 20 and 100):

data_with_na = [5, 10, 15, NA, 25, 30, 50, NA, 150, 200]

Step 3: Impute with Zero

data_imputed = [5, 10, 15, 0, 25, 30, 50, 0, 150, 200]
Sorted: [0, 0, 5, 10, 15, 25, 30, 50, 150, 200]

Step 4: New 95th Percentile

p95_imputed = np.percentile(data_imputed, 95)
# p95_imputed ≈ 187.5 (high percentile less affected)

Step 5: Effect on Lower Percentiles

p50_original = np.percentile(data, 50)  # ≈ 27.5
p50_imputed = np.percentile(data_imputed, 50)  # ≈ 20

Shift: 27.5 - 20 = 7.5 (27% reduction!)

Key Insights

  1. High percentiles (90th, 95th): Less affected by zero imputation
  2. Median (50th): Significantly shifted down
  3. Lower percentiles (10th, 25th): Pushed to zero
  4. Effect is proportional to: Missing rate and zero distance from median

Visual Example:

Exercise Percentile Shift

Best Practices for Imputation

1. Understand Your Missing Data

# Analyze missing pattern
missing_rate = df.isna().mean()
missing_correlation = df.isna().corr()

2. Document Your Strategy

IMPUTATION_CONFIG = {
'amount': {'method': 'zero', 'reason': 'NA means no transaction'},
'score': {'method': 'median', 'reason': 'Preserve central tendency'},
'count': {'method': 'exclude', 'reason': 'Analyze separately'},
}

3. Compare Before and After

def imputation_impact_report(original, imputed, percentiles=[25, 50, 75, 90, 95]):
report = {}
for p in percentiles:
orig = np.percentile(original.dropna(), p)
imp = np.percentile(imputed, p)
report[f'p{p}'] = {'original': orig, 'imputed': imp, 'shift': imp - orig}
return report

4. Consider Separate Rules for Missing

if pd.isna(value):
return apply_missing_rule(event)
else:
return apply_normal_rule(event, value)

5. Monitor Drift

Track imputation effects over time as missing patterns change.

6. Use Robust Statistics

When possible, use median-based methods that resist imputation artifacts.


Summary Table

Strategy Effect on Mean Effect on Variance Effect on Quantiles
Zero ↓ Decreases ↑ Increases ↓ Lower percentiles drop
Mean = Preserved ↓ Decreases ~ Middle compressed
Median ~ Slight shift ↓ Decreases = Median preserved
Exclude ? Depends ? Depends ? Depends on MCAR

Final Thoughts

Imputation and coercion are not neutral operations—they actively shape your data distribution:

  • Zero imputation creates spikes and shifts percentiles down
  • Mean/median imputation compresses variance
  • Exclusion may introduce selection bias

Key Takeaways:

Coercion must handle edge cases gracefully Zero imputation shifts lower percentiles significantly Mean imputation reduces variance artificially Median imputation is more robust but still affects distribution fillna(0) changes rule geometry by adding origin points Document and monitor your imputation choices

Clean data, clear thresholds!

Tomorrow's Preview: Day 29 - Putting It All Together: Constructing a Stratified Audit Plan


Sughosh P Dixit
Sughosh P Dixit
Data Scientist & Tech Writer
8 min read
Previous Post

Day 27: Quantile Stability, Ties, and Small Samples

Master practical considerations for computing empirical quantiles. Understand how ties, discrete samples, and different interpolation schemes affect quantile estimates and threshold repeatability.

Next Post

Day 29: Putting It All Together - Constructing a Stratified Audit Plan

Synthesize everything from quantile thresholds to strata to sample sizes. Learn to construct a complete stratified audit plan with cutoffs, sample sizes, and investigation workflows.