30 articles about Data Science
Extending Boolean rules to graded (0–1) degrees of truth by replacing AND with the minimum operator and OR with the maximum operator using Gödel t-norms.

Isolation Forest isolates anomalies by randomly partitioning feature space; points that require only a few splits are suspicious. This guide walks through the intuition, math, tuning tips, and practical tooling.

Kernel Density Estimation transforms discrete data into smooth, continuous distributions by placing a 'hill' at each data point. Master bandwidth selection, understand the bias-variance tradeoff, and learn when KDE beats histograms for comparing groups.

Transform overwhelming continuous data into digestible insights with binning. Master equal-width and equal-frequency binning, understand deciles (10 equal-frequency bins), and create powerful cross-tabs and heatmaps to reveal patterns in your data.

Stratified sampling divides your population into groups and samples from each separately, guaranteeing coverage of important subgroups and dramatically reducing variance. Learn proportional, equal, and Neyman allocation strategies to maximize precision.

When detecting rare events in finite populations, normal approximations fail spectacularly. Learn how the hypergeometric distribution solves the rare positive detection problem, calculating exact sample sizes for quality control, fraud detection, and rare disease screening.

Percentiles provide powerful, interpretable thresholds for decision-making without distributional assumptions. Learn how to use percentiles as cutoffs for loan approvals, performance rankings, and anomaly detection—turning any feature into a ranked decision rule.

Elbow detection finds the sweet spot where marginal returns drop sharply—the perfect stopping point for resource allocation, customer targeting, and clustering. Learn how second derivatives reveal where 'more' becomes 'enough'.

Ratios are powerful but dangerous — division by values near zero can make them explode! This post shows how to design stable ratio features, guard against tiny denominators, and choose a principled epsilon using robust measures like the Median Absolute Deviation (MAD).

Time is messy, calendars are inconsistent, and recurrence patterns hide traps! This post explores the common pitfalls of time-based data analysis—from unequal months to shifting weekly patterns—and provides robust Python code to handle them.

Precision, Recall, and F1 are fundamental evaluation metrics in classification. This post explores their mathematical foundations, trade-offs, and how threshold selection impacts these metrics—essential knowledge for building production ML systems.

How to teach computers to read and evaluate expressions step by step — by tokenizing text, enforcing operator precedence, and converting rules to postfix (RPN) form for speed, clarity and consistency.

Visualize how rule expressions create decision surfaces in two-dimensional feature space. Learn to understand half-space intersections, orthogonal partitions, and how threshold changes affect classification regions.

Learn to quantify uplift and effectiveness across bins and segments using contingency tables. Understand cell counts, rates, marginalization, and how to avoid Simpson's paradox when analyzing bin-wise trends.

Learn to measure overlap between sets using set theory fundamentals. Understand cardinalities, intersections, unions, and the Jaccard index—essential tools for comparing versions, thresholds, and events captured.

Learn to view event tagging as rule-based classification. Understand indicator functions, piecewise partitions, and priority-level conditioning—essential tools for mathematically partitioning events into Flagged and Passed categories.

Learn to interpret priority tiers as prior beliefs or cost weights. Understand cost-sensitive thresholding, Bayes optimal decision rules, and how per-tier thresholds change labeling geometry through iso-cost analysis.

Learn to pair semantically complementary configuration sets like Premium/Standard and Verified/Unverified. Understand equivalence relations, pairing consistency, and how mapping functions ensure aligned parameters across pairs.

Compare min/max logic with product t-norm and Łukasiewicz variants. Understand t-norm families, boundary behaviors, and why min/max yields conservative idempotent aggregation for rule strength evaluation.

Master practical considerations for computing empirical quantiles. Understand how ties, discrete samples, and different interpolation schemes affect quantile estimates and threshold repeatability.

Understand how numeric coercion and NA handling affect data distributions. Learn the impact of different imputation strategies on mean, variance, and quantiles for threshold-based rule evaluation.

Synthesize everything from quantile thresholds to strata to sample sizes. Learn to construct a complete stratified audit plan with cutoffs, sample sizes, and investigation workflows.

Master percentiles and quantiles—simple yet powerful tools to describe data distributions. From the empirical CDF to interpolation methods, learn how these robust measures help in thresholding, outlier detection, and monitoring.

A comprehensive mathematical summary mapping nonparametric statistics, robust measures, sampling theory, decision metrics, set operations, and fuzzy aggregation to their pipeline implementations.

Percentile ranks turn any numeric feature into a simple score in [0,1] that says 'what fraction of the data is at or below this value.' Learn how to combine ranks with min/max and create strata for sampling, prioritization, or analysis.

The mean and standard deviation (SD) can be swayed by outliers like reeds in the wind — a single extreme value can pull them off course. The median and MAD (Median Absolute Deviation), on the other hand, are sturdy rocks in the statistical stream. They resist distortion and give reliable 'center' and 'spread' estimates, even when your data are skewed or heavy-tailed.

Skewness tells you if data lean left or right (asymmetry). Kurtosis tells you how heavy the tails are (how many extremes you see). Two datasets can share the same mean and variance but look completely different — shape features reveal the hidden story.

Boxplots are the simplest visual way to spot outliers. They rely on the IQR (Interquartile Range) — the middle 50% of your data — and build 'fences' around it. Points outside these fences are suspected outliers. It's simple, robust, and doesn't assume your data are Normal.

Adjusted boxplots combine Tukey fences with the medcouple skewness measure so long tails do not trigger false outliers.

Compare classical z-scores built on mean and standard deviation with robust z-scores powered by the median and MAD to see why robustness matters when data gets messy.
