Sughosh Dixit
Sughosh P Dixit
2025-11-1913 min read

Day 19: Precision, Recall, and F1 as Objectives

Article Header Image

TL;DR

Quick summary

Precision, Recall, and F1 are fundamental evaluation metrics in classification. This post explores their mathematical foundations, trade-offs, and how threshold selection impacts these metrics—essential knowledge for building production ML systems.

Key takeaways
  • Day 19: Precision, Recall, and F1 as Objectives
Preview

Day 19: Precision, Recall, and F1 as Objectives

Precision, Recall, and F1 are fundamental evaluation metrics in classification. This post explores their mathematical foundations, trade-offs, and how threshold selection impacts these metrics—essential knowledge for building production ML systems.

Day 19: Precision, Recall, and F1 as Objectives 🎯📊

Master the fundamental metrics that drive classification decisions. Understand the precision-recall trade-off and learn to optimize for your specific use case.

Precision, Recall, and F1 are the building blocks of classification evaluation. Understanding their trade-offs is crucial for building effective ML systems.

When building classification models, accuracy alone isn't enough. Precision, Recall, and F1 score reveal the true performance of your model—especially when dealing with imbalanced classes. Learn how to interpret these metrics and optimize thresholds for your specific objectives.

💡 Note: This article uses technical terms and abbreviations. For definitions, check out the Key Terms & Glossary page.


The Problem with Accuracy Alone 🚫

Scenario: You're building a fraud detection system.

Total transactions: 10,000
Fraudulent transactions: 100 (1%)
Legitimate transactions: 9,900 (99%)

Naive model: "Always predict legitimate"

Accuracy = 9,900 / 10,000 = 99% ✅

But wait! This model caught zero fraud cases! 💥

The problem: Accuracy is misleading when classes are imbalanced. We need metrics that focus on what matters: finding the positive cases (fraud) and not raising false alarms.


The Confusion Matrix: Foundation of All Metrics 📋

Before we dive into Precision, Recall, and F1, we need to understand the Confusion Matrix—the foundation that makes these metrics possible.

What is a Confusion Matrix?

A confusion matrix is a table that shows how well your model performs by comparing predicted vs actual values:

                    Predicted
                 Positive  Negative
Actual Positive    TP       FN
       Negative    FP       TN

Key Terms:

  • TP (True Positive): Correctly predicted positive cases ✅
  • TN (True Negative): Correctly predicted negative cases ✅
  • FP (False Positive): Incorrectly predicted as positive (Type I error) ❌
  • FN (False Negative): Incorrectly predicted as negative (Type II error) ❌

Visual Example:

Confusion Matrix Visualization

The confusion matrix breaks down your model's predictions into four categories, revealing where it succeeds and where it fails.


Precision: How Accurate Are Your Positive Predictions? 🎯

Definition: Precision measures the proportion of positive predictions that are actually correct.

Formula:

Precision = TP / (TP + FP)

Interpretation:

  • High Precision: When you predict positive, you're usually right ✅
  • Low Precision: Many of your positive predictions are wrong ❌

Example:

Model predictions:
- Predicted 50 fraud cases
- Actually fraudulent: 40
- False alarms: 10

Precision = 40 / (40 + 10) = 40 / 50 = 0.80 = 80%

When to optimize for Precision:

  • Cost of false positives is high (e.g., blocking legitimate customers)
  • Limited resources to investigate positive predictions
  • Spam detection: Better to let some spam through than block legitimate emails

The Precision Question:

"Of all the cases I flagged as positive, how many were actually positive?"


Recall: How Many Positives Did You Find? 🔍

Definition: Recall (also called Sensitivity or True Positive Rate) measures the proportion of actual positives that you correctly identified.

Formula:

Recall = TP / (TP + FN)

Interpretation:

  • High Recall: You catch most of the positive cases ✅
  • Low Recall: You miss many positive cases ❌

Example:

Actual fraud cases: 100
Model found: 80
Missed: 20

Recall = 80 / (80 + 20) = 80 / 100 = 0.80 = 80%

When to optimize for Recall:

  • Cost of false negatives is high (e.g., missing cancer diagnosis)
  • Finding all positives is critical (e.g., security threats)
  • Medical diagnosis: Better to have false alarms than miss real cases

The Recall Question:

"Of all the actual positive cases, how many did I find?"


The Precision-Recall Trade-Off ⚖️

The fundamental tension: You can't maximize both Precision and Recall simultaneously.

Why? Because they measure different things:

  • Precision: Quality of your positive predictions
  • Recall: Coverage of actual positives

Visual Example:

Precision-Recall Trade-off

Common Scenarios:

Scenario 1: High Precision, Low Recall

Precision = 95% (very few false positives)
Recall = 30% (misses 70% of positives)

Use case: Email spam filter
- Better to let some spam through
- Don't want to block important emails

Scenario 2: Low Precision, High Recall

Precision = 40% (many false positives)
Recall = 95% (catches almost all positives)

Use case: Medical screening
- Better to have false alarms
- Can't afford to miss real cases

Scenario 3: Balanced

Precision = 75%
Recall = 75%

Use case: General classification
- Good balance for most applications

The precision-recall trade-off forces you to choose based on your specific use case. There's no one-size-fits-all solution.


F1 Score: The Harmonic Mean 🎵

Definition: F1 score is the harmonic mean of Precision and Recall, providing a single metric that balances both.

Formula:

F1 = 2 × (Precision × Recall) / (Precision + Recall)

Why Harmonic Mean?

  • Harmonic mean penalizes extreme values
  • If either Precision or Recall is low, F1 will be low
  • Encourages balanced performance

Example:

Precision = 0.80
Recall = 0.60

F1 = 2 × (0.80 × 0.60) / (0.80 + 0.60)
   = 2 × 0.48 / 1.40
   = 0.96 / 1.40
   = 0.686 ≈ 69%

When to use F1:

  • Balanced objective: You care about both Precision and Recall equally
  • Class imbalance: F1 is more informative than accuracy
  • Comparing models: Single metric for easy comparison

F1 Score Properties:

  • Range: 0 to 1 (higher is better)
  • F1 = 1: Perfect Precision and Recall
  • F1 = 0: Either Precision or Recall is 0

Threshold Selection: Moving the Precision-Recall Curve 📈

Key Insight: Changing the classification threshold moves you along the Precision-Recall curve.

How Thresholds Work

Most classification models output a probability score (0 to 1). You then choose a threshold to convert probabilities to binary predictions:

If probability >= threshold: Predict Positive
If probability < threshold: Predict Negative

Example:

Model outputs probabilities:
- Transaction A: 0.95 → Predict Fraud (high confidence)
- Transaction B: 0.65 → Predict Fraud (medium confidence)
- Transaction C: 0.30 → Predict Legitimate (low confidence)

The Threshold Effect

Low Threshold (e.g., 0.3):

  • More positive predictions
  • Higher Recall (finds more positives)
  • Lower Precision (more false positives)

High Threshold (e.g., 0.8):

  • Fewer positive predictions
  • Lower Recall (misses more positives)
  • Higher Precision (fewer false positives)

Visual Example:

Precision-Recall Curve

Exercise: Threshold Selection

Let's see how raising a threshold affects Precision and Recall:

Show code (27 lines)
# Example: Fraud detection model
# Lower threshold = more predictions = higher recall, lower precision
# Higher threshold = fewer predictions = lower recall, higher precision

thresholds = [0.1, 0.3, 0.5, 0.7, 0.9]
results = []

for threshold in thresholds:
    # Predict positive if probability >= threshold
    predictions = (probabilities >= threshold)
    
    # Calculate metrics
    tp = sum((predictions == 1) & (actual == 1))
    fp = sum((predictions == 1) & (actual == 0))
    fn = sum((predictions == 0) & (actual == 1))
    
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
    
    results.append({
        'threshold': threshold,
        'precision': precision,
        'recall': recall,
        'f1': f1
    })

Expected Pattern:

  • As threshold increases → Precision increases, Recall decreases
  • As threshold decreases → Precision decreases, Recall increases
  • F1 peaks at some intermediate threshold

Threshold selection is like adjusting a dial—turn it one way for precision, the other for recall. The PR curve shows you all possible trade-offs.


Class Imbalance Effects: Why Accuracy Fails 📉

The Problem: When classes are imbalanced, accuracy becomes a misleading metric.

Example: Rare Disease Detection

Total patients: 10,000
Disease present: 50 (0.5%)
Disease absent: 9,950 (99.5%)

Model A: Always predict "no disease"

TP = 0, TN = 9,950, FP = 0, FN = 50
Accuracy = (0 + 9,950) / 10,000 = 99.5% ✅
Precision = 0 / (0 + 0) = undefined
Recall = 0 / (0 + 50) = 0% ❌
F1 = 0 ❌

Model B: Actually detects disease

TP = 40, TN = 9,900, FP = 50, FN = 10
Accuracy = (40 + 9,900) / 10,000 = 99.4% ✅
Precision = 40 / (40 + 50) = 44.4%
Recall = 40 / (40 + 10) = 80% ✅
F1 = 2 × (0.444 × 0.80) / (0.444 + 0.80) = 57.1%

Key Insight:

  • Model A has higher accuracy but is useless (finds no disease)
  • Model B has slightly lower accuracy but is actually useful (finds 80% of cases)

Why Precision, Recall, and F1 Matter:

  • They focus on the positive class (the rare, important class)
  • They reveal model performance that accuracy hides
  • They guide threshold selection for imbalanced problems

PR Curves vs Single F1 Point 📊

Precision-Recall (PR) Curve

A PR curve plots Precision (y-axis) vs Recall (x-axis) for different threshold values.

What it shows:

  • All possible trade-offs between Precision and Recall
  • Area Under Curve (AUC-PR): Overall model performance
  • Optimal threshold: Point closest to (1, 1) or highest F1

Visual Example:

PR Curve with F1 Points

Single F1 Point

A single F1 score represents one point on the PR curve (at a specific threshold).

Limitations:

  • Doesn't show the full trade-off space
  • Depends on threshold selection
  • May not reflect model's true potential

When to use each:

  • PR Curve: Model development, threshold selection, comparing models
  • Single F1: Production monitoring, reporting, when threshold is fixed

Real-World Application: Overlay Analysis and Point of Productivity 🏭

Overlay Analysis

Concept: Overlay analysis picks thresholds that optimize Precision, Recall, or F1 based on business objectives.

Process:

  1. Generate PR curve for your model
  2. Identify business constraints (e.g., max false positive rate)
  3. Select threshold that maximizes objective within constraints
  4. Deploy model with selected threshold

Example:

Show code (33 lines)
def overlay_analysis(probabilities, actual, max_fp_rate=0.05):
    """
    Find optimal threshold using overlay analysis
    
    Constraints:
    - False positive rate <= max_fp_rate
    - Maximize F1 score
    """
    thresholds = np.linspace(0, 1, 100)
    best_threshold = 0.5
    best_f1 = 0
    
    for threshold in thresholds:
        predictions = (probabilities >= threshold)
        tp = sum((predictions == 1) & (actual == 1))
        fp = sum((predictions == 1) & (actual == 0))
        fn = sum((predictions == 0) & (actual == 1))
        tn = sum((predictions == 0) & (actual == 0))
        
        fpr = fp / (fp + tn) if (fp + tn) > 0 else 0
        
        # Check constraint
        if fpr <= max_fp_rate:
            precision = tp / (tp + fp) if (tp + fp) > 0 else 0
            recall = tp / (tp + fn) if (tp + fn) > 0 else 0
            f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
            
            if f1 > best_f1:
                best_f1 = f1
                best_threshold = threshold
    
    return best_threshold, best_f1

Point of Productivity

Concept: Point of Productivity computes per-rule F1 score to identify which rules contribute most to overall performance.

Use Case: Rule-based systems where you want to:

  • Identify high-performing rules
  • Remove low-performing rules
  • Optimize rule combinations

Example:

Show code (23 lines)
def point_of_productivity(rules, predictions, actual):
    """
    Compute F1 score for each rule
    """
    rule_f1_scores = {}
    
    for rule_name, rule_predictions in rules.items():
        tp = sum((rule_predictions == 1) & (actual == 1))
        fp = sum((rule_predictions == 1) & (actual == 0))
        fn = sum((rule_predictions == 0) & (actual == 1))
        
        precision = tp / (tp + fp) if (tp + fp) > 0 else 0
        recall = tp / (tp + fn) if (tp + fn) > 0 else 0
        f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
        
        rule_f1_scores[rule_name] = {
            'precision': precision,
            'recall': recall,
            'f1': f1
        }
    
    return rule_f1_scores

Best Practices for Using Precision, Recall, and F1 ✅

1. Choose Metrics Based on Use Case

High Precision:

  • Spam detection
  • Content moderation
  • Quality control

High Recall:

  • Medical diagnosis
  • Security threats
  • Fraud detection (when cost of missing is high)

Balanced (F1):

  • General classification
  • When both errors matter equally

2. Always Report Multiple Metrics

Don't rely on a single metric. Report:

  • Precision
  • Recall
  • F1
  • Confusion Matrix (for full picture)

3. Use PR Curves for Threshold Selection

  • Plot PR curve for your model
  • Identify business constraints
  • Select threshold that optimizes your objective
  • Document your threshold selection rationale

4. Consider Class Imbalance

  • Accuracy is misleading with imbalanced classes
  • Use Precision, Recall, and F1 instead
  • Consider weighted F1 for multi-class problems

5. Monitor Metrics in Production

  • Track Precision, Recall, and F1 over time
  • Set up alerts for metric degradation
  • Retrain when metrics drop below thresholds

Summary Table 📋

Metric Formula Focus Use When
Precision TP / (TP + FP) Quality of positive predictions False positives are costly
Recall TP / (TP + FN) Coverage of actual positives False negatives are costly
F1 Score 2PR / (P + R) Balance of Precision and Recall Need balanced performance

Final Thoughts 🌟

Precision, Recall, and F1 are fundamental metrics that reveal the true performance of classification models—especially when dealing with imbalanced classes. Understanding their trade-offs and how threshold selection affects them is crucial for building production ML systems.

Key Takeaways:

Precision measures how accurate your positive predictions are ✅ Recall measures how many positives you find ✅ F1 balances both metrics ✅ Threshold selection moves you along the PR curve ✅ Class imbalance makes accuracy misleading—use Precision, Recall, and F1 instead

Your model's performance depends on your objectives. Choose your metrics wisely! 🎯

Tomorrow's Preview: Day 20 - Confidence intervals for proportions (Wilson score, Clopper-Pearson), where we'll learn how to put error bars around percentages! 📊🎯


Sughosh P Dixit
Sughosh P Dixit
Data Scientist & Tech Writer
13 min read
Previous Post

Day 18: Time and Recurrence Math - When Calendars Attack Your Data

Time is messy, calendars are inconsistent, and recurrence patterns hide traps! This post explores the common pitfalls of time-based data analysis—from unequal months to shifting weekly patterns—and provides robust Python code to handle them.

Next Post

Day 2 — Expressions as Algebra: Tokens, Precedence, and Infix → Postfix

How to teach computers to read and evaluate expressions step by step — by tokenizing text, enforcing operator precedence, and converting rules to postfix (RPN) form for speed, clarity and consistency.