Data Science

30 articles about Data Science

Sughosh DixitSughosh Dixit2025-11-01

Day 1 — Boolean Logic to Numbers: AND as min, OR as max

Extending Boolean rules to graded (0–1) degrees of truth by replacing AND with the minimum operator and OR with the maximum operator using Gödel t-norms.

10 min read
Day 1 — Boolean Logic to Numbers: AND as min, OR as max
Sughosh DixitSughosh Dixit2025-11-10

Day 10 — Isolation Forest: Finding Outliers by Getting Lost in the Woods

Isolation Forest isolates anomalies by randomly partitioning feature space; points that require only a few splits are suspicious. This guide walks through the intuition, math, tuning tips, and practical tooling.

10 min read
Day 10 — Isolation Forest: Finding Outliers by Getting Lost in the Woods
Sughosh DixitSughosh Dixit2025-11-11

Day 11 — Kernel Density Estimation: Smoothing Out the Bumps

Kernel Density Estimation transforms discrete data into smooth, continuous distributions by placing a 'hill' at each data point. Master bandwidth selection, understand the bias-variance tradeoff, and learn when KDE beats histograms for comparing groups.

14 min read
Day 11 — Kernel Density Estimation: Smoothing Out the Bumps
Sughosh DixitSughosh Dixit2025-11-12

Day 12 — Binning and Deciles: Taming Continuous Chaos

Transform overwhelming continuous data into digestible insights with binning. Master equal-width and equal-frequency binning, understand deciles (10 equal-frequency bins), and create powerful cross-tabs and heatmaps to reveal patterns in your data.

16 min read
Day 12 — Binning and Deciles: Taming Continuous Chaos
Sughosh DixitSughosh Dixit2025-11-13

Day 13 — Stratified Sampling: The Smart Way to Sample

Stratified sampling divides your population into groups and samples from each separately, guaranteeing coverage of important subgroups and dramatically reducing variance. Learn proportional, equal, and Neyman allocation strategies to maximize precision.

17 min read
Day 13 — Stratified Sampling: The Smart Way to Sample
Sughosh DixitSughosh Dixit2025-11-14

Day 14 — Hypergeometric Distribution & Sample Size: Finding Needles in Haystacks

When detecting rare events in finite populations, normal approximations fail spectacularly. Learn how the hypergeometric distribution solves the rare positive detection problem, calculating exact sample sizes for quality control, fraud detection, and rare disease screening.

15 min read
Day 14 — Hypergeometric Distribution & Sample Size: Finding Needles in Haystacks
Sughosh DixitSughosh Dixit2025-11-15

Day 15 — Percentiles as Thresholds: Drawing Lines in the Sand

Percentiles provide powerful, interpretable thresholds for decision-making without distributional assumptions. Learn how to use percentiles as cutoffs for loan approvals, performance rankings, and anomaly detection—turning any feature into a ranked decision rule.

19 min read
Day 15 — Percentiles as Thresholds: Drawing Lines in the Sand
Sughosh DixitSughosh Dixit2025-11-16

Day 16 Knee Elbow Detection Finding the Sweet Spot

Elbow detection finds the sweet spot where marginal returns drop sharply—the perfect stopping point for resource allocation, customer targeting, and clustering. Learn how second derivatives reveal where 'more' becomes 'enough'.

17 min read
Day 16 Knee Elbow Detection Finding the Sweet Spot
Sughosh DixitSughosh Dixit2025-11-17

Day 17 — Robust Ratios and Division by Zero

Ratios are powerful but dangerous — division by values near zero can make them explode! This post shows how to design stable ratio features, guard against tiny denominators, and choose a principled epsilon using robust measures like the Median Absolute Deviation (MAD).

12 min read
Day 17 — Robust Ratios and Division by Zero
Sughosh DixitSughosh Dixit2025-11-18

Day 18: Time and Recurrence Math - When Calendars Attack Your Data

Time is messy, calendars are inconsistent, and recurrence patterns hide traps! This post explores the common pitfalls of time-based data analysis—from unequal months to shifting weekly patterns—and provides robust Python code to handle them.

5 min read
Day 18: Time and Recurrence Math - When Calendars Attack Your Data
Sughosh DixitSughosh Dixit2025-11-19

Day 19: Precision, Recall, and F1 as Objectives

Precision, Recall, and F1 are fundamental evaluation metrics in classification. This post explores their mathematical foundations, trade-offs, and how threshold selection impacts these metrics—essential knowledge for building production ML systems.

13 min read
Day 19: Precision, Recall, and F1 as Objectives
Sughosh DixitSughosh Dixit2025-11-02

Day 2 — Expressions as Algebra: Tokens, Precedence, and Infix → Postfix

How to teach computers to read and evaluate expressions step by step — by tokenizing text, enforcing operator precedence, and converting rules to postfix (RPN) form for speed, clarity and consistency.

7 min read
Day 2 — Expressions as Algebra: Tokens, Precedence, and Infix → Postfix
Sughosh DixitSughosh Dixit2025-11-20

Day 20: Two-Feature Decision Surfaces from Rule Expressions

Visualize how rule expressions create decision surfaces in two-dimensional feature space. Learn to understand half-space intersections, orthogonal partitions, and how threshold changes affect classification regions.

10 min read
Day 20: Two-Feature Decision Surfaces from Rule Expressions
Sughosh DixitSughosh Dixit2025-11-21

Day 21: Contingency Tables and Bin-Wise Uplift

Learn to quantify uplift and effectiveness across bins and segments using contingency tables. Understand cell counts, rates, marginalization, and how to avoid Simpson's paradox when analyzing bin-wise trends.

10 min read
Day 21: Contingency Tables and Bin-Wise Uplift
Sughosh DixitSughosh Dixit2025-11-22

Day 22: Set Theory and Venn Diagrams for Comparisons

Learn to measure overlap between sets using set theory fundamentals. Understand cardinalities, intersections, unions, and the Jaccard index—essential tools for comparing versions, thresholds, and events captured.

12 min read
Day 22: Set Theory and Venn Diagrams for Comparisons
Sughosh DixitSughosh Dixit2025-11-23

Day 23: Label Post-Processing: Partitioning Flagged vs Passed Mathematically

Learn to view event tagging as rule-based classification. Understand indicator functions, piecewise partitions, and priority-level conditioning—essential tools for mathematically partitioning events into Flagged and Passed categories.

11 min read
Day 23: Label Post-Processing: Partitioning Flagged vs Passed Mathematically
Sughosh DixitSughosh Dixit2025-11-24

Day 24: Risk Segmentation - Priority Tiers as Priors and Costs

Learn to interpret priority tiers as prior beliefs or cost weights. Understand cost-sensitive thresholding, Bayes optimal decision rules, and how per-tier thresholds change labeling geometry through iso-cost analysis.

11 min read
Day 24: Risk Segmentation - Priority Tiers as Priors and Costs
Sughosh DixitSughosh Dixit2025-11-25

Day 25: Configuration Pairing Logic and Equivalence Classes

Learn to pair semantically complementary configuration sets like Premium/Standard and Verified/Unverified. Understand equivalence relations, pairing consistency, and how mapping functions ensure aligned parameters across pairs.

10 min read
Day 25: Configuration Pairing Logic and Equivalence Classes
Sughosh DixitSughosh Dixit2025-11-26

Day 26: From Rules to Fuzzy Logic - Why Min-Max Matters

Compare min/max logic with product t-norm and Łukasiewicz variants. Understand t-norm families, boundary behaviors, and why min/max yields conservative idempotent aggregation for rule strength evaluation.

9 min read
Day 26: From Rules to Fuzzy Logic - Why Min-Max Matters
Sughosh DixitSughosh Dixit2025-11-27

Day 27: Quantile Stability, Ties, and Small Samples

Master practical considerations for computing empirical quantiles. Understand how ties, discrete samples, and different interpolation schemes affect quantile estimates and threshold repeatability.

9 min read
Day 27: Quantile Stability, Ties, and Small Samples
Sughosh DixitSughosh Dixit2025-11-28

Day 28: Robust Imputation and Numeric Coercion

Understand how numeric coercion and NA handling affect data distributions. Learn the impact of different imputation strategies on mean, variance, and quantiles for threshold-based rule evaluation.

8 min read
Day 28: Robust Imputation and Numeric Coercion
Sughosh DixitSughosh Dixit2025-11-29

Day 29: Putting It All Together - Constructing a Stratified Audit Plan

Synthesize everything from quantile thresholds to strata to sample sizes. Learn to construct a complete stratified audit plan with cutoffs, sample sizes, and investigation workflows.

9 min read
Day 29: Putting It All Together - Constructing a Stratified Audit Plan
Sughosh DixitSughosh Dixit2025-11-03

Day 3 — Percentiles and Quantiles: Understanding Data Distributions

Master percentiles and quantiles—simple yet powerful tools to describe data distributions. From the empirical CDF to interpolation methods, learn how these robust measures help in thresholding, outlier detection, and monitoring.

8 min read
Day 3 — Percentiles and Quantiles: Understanding Data Distributions
Sughosh DixitSughosh Dixit2025-11-30

Day 30: A Mathematical Blueprint for Robust Decision Frameworks

A comprehensive mathematical summary mapping nonparametric statistics, robust measures, sampling theory, decision metrics, set operations, and fuzzy aggregation to their pipeline implementations.

11 min read
Day 30: A Mathematical Blueprint for Robust Decision Frameworks
Sughosh DixitSughosh Dixit2025-11-04

Day 4 — Percentile Rank and Stratifications

Percentile ranks turn any numeric feature into a simple score in [0,1] that says 'what fraction of the data is at or below this value.' Learn how to combine ranks with min/max and create strata for sampling, prioritization, or analysis.

7 min read
Day 4 — Percentile Rank and Stratifications
Sughosh DixitSughosh Dixit2025-11-05

Day 5 — Robust Location and Scale: Median & MAD (Simple Guide + Worked Example)

The mean and standard deviation (SD) can be swayed by outliers like reeds in the wind — a single extreme value can pull them off course. The median and MAD (Median Absolute Deviation), on the other hand, are sturdy rocks in the statistical stream. They resist distortion and give reliable 'center' and 'spread' estimates, even when your data are skewed or heavy-tailed.

6 min read
Day 5 — Robust Location and Scale: Median & MAD (Simple Guide + Worked Example)
Sughosh DixitSughosh Dixit2025-11-06

Day 6 — Distribution Shape: Skewness and Kurtosis (Simple Guide + Visuals)

Skewness tells you if data lean left or right (asymmetry). Kurtosis tells you how heavy the tails are (how many extremes you see). Two datasets can share the same mean and variance but look completely different — shape features reveal the hidden story.

7 min read
Day 6 — Distribution Shape: Skewness and Kurtosis (Simple Guide + Visuals)
Sughosh DixitSughosh Dixit2025-11-07

Day 7 — Boxplots, IQR, and Tukey Fences

Boxplots are the simplest visual way to spot outliers. They rely on the IQR (Interquartile Range) — the middle 50% of your data — and build 'fences' around it. Points outside these fences are suspected outliers. It's simple, robust, and doesn't assume your data are Normal.

6 min read
Day 7 — Boxplots, IQR, and Tukey Fences
Sughosh DixitSughosh Dixit2025-11-08

Day 8 — Adjusted Boxplots & Medcouple

Adjusted boxplots combine Tukey fences with the medcouple skewness measure so long tails do not trigger false outliers.

5 min read
Day 8 — Adjusted Boxplots & Medcouple
Sughosh DixitSughosh Dixit2025-11-09

Day 9 — Z-Scores vs Robust Z-Scores

Compare classical z-scores built on mean and standard deviation with robust z-scores powered by the median and MAD to see why robustness matters when data gets messy.

6 min read
Day 9 — Z-Scores vs Robust Z-Scores