Day 28: Robust Imputation and Numeric Coercion π§π
Understand how data preprocessing choices affect your distributions and downstream thresholds.
Imputation and coercion choices directly impact statistical measures and threshold calibrationβchoose wisely.
Before rules can be evaluated, data must be clean. How you handle missing values and convert data types has profound effects on the distributions your thresholds are based on.
π‘ Note: This article uses technical terms and abbreviations. For definitions, check out the Key Terms & Glossary page.
The Problem: Missing Data and Type Mismatches π―
Scenario: Your fraud detection pipeline receives raw data with:
- Missing values (NA, NULL, "")
- Mixed types ("100", 100, "N/A")
- Invalid entries ("error", -999)
Questions:
- How do you convert everything to numeric?
- What do you replace missing values with?
- How do these choices affect thresholds?
The answer matters more than you think! π€
Numeric Coercion π’
What is Coercion?
Coercion is converting data from one type to another.
Common scenarios:
- String β Number: "100" β 100
- Number β Integer: 100.7 β 100 or 101
- Invalid β NA: "error" β NA
Coercion in Practice
Show code (9 lines)
def safe_numeric(value, default=None):
"""
Safely convert to numeric with fallback.
"""
try:
return float(value)
except (ValueError, TypeError):
return default
Edge Cases
What happens with:
safe_numeric("100") # β 100.0
safe_numeric("$100") # β None (or clean first)
safe_numeric("") # β None
safe_numeric(None) # β None
safe_numeric("1e10") # β 10000000000.0
safe_numeric("NaN") # β nan (careful!)
Visual Example:
Coercion must handle edge cases gracefully to prevent downstream errors in rule evaluation.
Imputation Strategies π©Ή
Strategy 1: Impute with Zero
Method: Replace NA with 0
df['amount'] = df['amount'].fillna(0)
When appropriate:
- NA genuinely means "no activity"
- Zero is a valid value in the domain
- You want to flag non-activity
Risks:
- Shifts distribution left
- Inflates zero-count
- Can dramatically change quantiles
Strategy 2: Impute with Mean
Method: Replace NA with the column mean
df['amount'] = df['amount'].fillna(df['amount'].mean())
When appropriate:
- Missing at random (MAR)
- Want to preserve mean
- Large sample sizes
Risks:
- Reduces variance
- Can distort relationships
- Ignores missingness pattern
Strategy 3: Impute with Median
Method: Replace NA with the column median
df['amount'] = df['amount'].fillna(df['amount'].median())
When appropriate:
- Skewed distributions
- Outliers present
- Want robust central tendency
Risks:
- Still reduces variance
- Ignores missingness pattern
- May create spike at median
Strategy 4: Keep as NA (Exclude)
Method: Leave missing and handle separately
df_clean = df.dropna(subset=['amount'])
When appropriate:
- Missingness is informative
- Separate rules for missing
- Small fraction missing
Risks:
- Reduces sample size
- Selection bias if not MCAR
Visual Example:
Impact on Distribution Statistics π
Mean
Original: ΞΌ = Ξ£x_i / n
After zero imputation:
- Mean decreases (zeros pull it down)
- Effect proportional to missing rate
After mean imputation:
- Mean preserved (by design)
- But artificial values added
Variance
Original: ΟΒ² = Ξ£(x_i - ΞΌ)Β² / n
After zero imputation:
- Variance increases if mean > 0
- Creates bimodal distribution
After mean imputation:
- Variance decreases
- Imputed values have zero deviation
Formula for variance reduction:
ΟΒ²_new β ΟΒ²_old Γ (1 - missing_rate)
Quantiles
Zero imputation effect:
Original: [10, 20, 30, 40, 50] β p50 = 30
With 20% zeros: [0, 10, 20, 30, 40, 50] β p50 = 25 (shifted!)
Median imputation effect:
Original: [10, 20, 30, 40, 50] β p50 = 30
With imputed 30s: [10, 20, 30, 30, 30, 40, 50] β p50 = 30 (preserved)
But p75 = 30 (distorted!)
Visual Example:
Different imputation methods shift distributions in different waysβunderstand the trade-offs.
Rule Geometry Changes π
How fillna(0) Changes Rules
Before imputation:
Rule: IF amount > threshold THEN flag
Data: [10, 20, NA, 40, 50]
Threshold (p50 excluding NA): 30
Flagged: [40, 50] (2 values)
After fillna(0):
Data: [10, 20, 0, 40, 50]
Threshold (p50): 20 β Lower!
Flagged: [40, 50] (still 2, but threshold changed)
Geometric Interpretation
In feature space:
- Zero imputation adds points at the origin
- This shifts decision boundaries
- Thresholds based on quantiles move
Example: 2D space
Before: Points scattered in positive quadrant
After: Cluster of zeros at (0, 0)
Decision boundary must now separate:
- True zeros (valid)
- Imputed zeros (missing)
- Positive values
Visual Example:
Histograms: Before and After π
Visual Comparison
Original distribution (no missing):
β ββ
β ββββ
β ββββββ
β ββββββββ
βββββββββββ
βββββββββββββββ
0 20 40 60
After zero imputation (20% missing):
Show code (10 lines)
ββ
ββ ββ
ββ ββββ
ββ ββββββ
ββββββββββββ
βββββββββββββββ
0 20 40 60
β
Spike!
After mean imputation:
Show code (10 lines)
β ββ
β ββββ
β ββββββ
β ββββββββ
βββββββββββ
βββββββββββββββ
0 20 40 60
β
Spike at mean
Visual Example:
Exercise: Quantifying Percentile Shift π
The Problem
Given: A skewed sample with 20% missing values.
Original data (no missing):
[5, 10, 15, 20, 25, 30, 50, 100, 150, 200]
Task: Quantify how replacing 2 random values with NA, then imputing with 0, shifts the 95th percentile.
Solution
Step 1: Original 95th Percentile
data = [5, 10, 15, 20, 25, 30, 50, 100, 150, 200]
p95_original = np.percentile(data, 95)
# p95_original β 187.5 (interpolated)
Step 2: Remove 2 Values (20% missing)
Suppose we remove indices 3 and 7 (values 20 and 100):
data_with_na = [5, 10, 15, NA, 25, 30, 50, NA, 150, 200]
Step 3: Impute with Zero
data_imputed = [5, 10, 15, 0, 25, 30, 50, 0, 150, 200]
Sorted: [0, 0, 5, 10, 15, 25, 30, 50, 150, 200]
Step 4: New 95th Percentile
p95_imputed = np.percentile(data_imputed, 95)
# p95_imputed β 187.5 (high percentile less affected)
Step 5: Effect on Lower Percentiles
p50_original = np.percentile(data, 50) # β 27.5
p50_imputed = np.percentile(data_imputed, 50) # β 20
Shift: 27.5 - 20 = 7.5 (27% reduction!)
Key Insights
- High percentiles (90th, 95th): Less affected by zero imputation
- Median (50th): Significantly shifted down
- Lower percentiles (10th, 25th): Pushed to zero
- Effect is proportional to: Missing rate and zero distance from median
Visual Example:
Zero imputation primarily affects lower and middle percentiles, with diminishing effect on extreme upper percentiles.
Best Practices for Imputation β
1. Understand Your Missing Data
# Analyze missing pattern
missing_rate = df.isna().mean()
missing_correlation = df.isna().corr()
2. Document Your Strategy
IMPUTATION_CONFIG = {
'amount': {'method': 'zero', 'reason': 'NA means no transaction'},
'score': {'method': 'median', 'reason': 'Preserve central tendency'},
'count': {'method': 'exclude', 'reason': 'Analyze separately'},
}
3. Compare Before and After
def imputation_impact_report(original, imputed, percentiles=[25, 50, 75, 90, 95]):
report = {}
for p in percentiles:
orig = np.percentile(original.dropna(), p)
imp = np.percentile(imputed, p)
report[f'p{p}'] = {'original': orig, 'imputed': imp, 'shift': imp - orig}
return report
4. Consider Separate Rules for Missing
if pd.isna(value):
return apply_missing_rule(event)
else:
return apply_normal_rule(event, value)
5. Monitor Drift
Track imputation effects over time as missing patterns change.
6. Use Robust Statistics
When possible, use median-based methods that resist imputation artifacts.
Summary Table π
| Strategy | Effect on Mean | Effect on Variance | Effect on Quantiles |
|---|---|---|---|
| Zero | β Decreases | β Increases | β Lower percentiles drop |
| Mean | = Preserved | β Decreases | ~ Middle compressed |
| Median | ~ Slight shift | β Decreases | = Median preserved |
| Exclude | ? Depends | ? Depends | ? Depends on MCAR |
Final Thoughts π
Imputation and coercion are not neutral operationsβthey actively shape your data distribution:
- Zero imputation creates spikes and shifts percentiles down
- Mean/median imputation compresses variance
- Exclusion may introduce selection bias
Key Takeaways:
β Coercion must handle edge cases gracefully β Zero imputation shifts lower percentiles significantly β Mean imputation reduces variance artificially β Median imputation is more robust but still affects distribution β fillna(0) changes rule geometry by adding origin points β Document and monitor your imputation choices
Clean data, clear thresholds! π§π―
Tomorrow's Preview: Day 29 - Putting It All Together: Constructing a Stratified Audit Plan ππ―




