Data Science

30 articles about Data Science

Sughosh Dixit•2025-11-01

Day 1 — Boolean Logic to Numbers: AND as min, OR as max

Extending Boolean rules to graded (0–1) degrees of truth by replacing AND with the minimum operator and OR with the maximum operator using Gödel t-norms.

8 min read

DataScienceMathematicsBooleanLogicFuzzyLogicT-norms

Sughosh Dixit•2025-11-10

Day 10 — Isolation Forest: Finding Outliers by Getting Lost in the Woods

Isolation Forest isolates anomalies by randomly partitioning feature space; points that require only a few splits are suspicious. This guide walks through the intuition, math, tuning tips, and practical tooling.

10 min read

DataScienceStatisticsIsolationForestOutliersAnomalyDetectionEnsembleTreesRandomizedAlgorithms

Sughosh Dixit•2025-11-11

Day 11 — Kernel Density Estimation: Smoothing Out the Bumps

Kernel Density Estimation transforms discrete data into smooth, continuous distributions by placing a 'hill' at each data point. Master bandwidth selection, understand the bias-variance tradeoff, and learn when KDE beats histograms for comparing groups.

13 min read

DataScienceStatisticsKernelDensityEstimationKDEBandwidthSmoothingDistributionEstimationNonparametric

Sughosh Dixit•2025-11-12

Day 12 — Binning and Deciles: Taming Continuous Chaos

Transform overwhelming continuous data into digestible insights with binning. Master equal-width and equal-frequency binning, understand deciles (10 equal-frequency bins), and create powerful cross-tabs and heatmaps to reveal patterns in your data.

15 min read

DataScienceStatisticsBinningDecilesQuantilesDiscretizationContingencyTablesCross-TabsHeatmapsEqual-WidthEqual-Frequency

Sughosh Dixit•2025-11-13

Day 13 — Stratified Sampling: The Smart Way to Sample

Stratified sampling divides your population into groups and samples from each separately, guaranteeing coverage of important subgroups and dramatically reducing variance. Learn proportional, equal, and Neyman allocation strategies to maximize precision.

16 min read

DataScienceStatisticsSamplingStratifiedSamplingVarianceReductionAllocationNeymanProportional

Sughosh Dixit•2025-11-14

Day 14 — Hypergeometric Distribution & Sample Size: Finding Needles in Haystacks

When detecting rare events in finite populations, normal approximations fail spectacularly. Learn how the hypergeometric distribution solves the rare positive detection problem, calculating exact sample sizes for quality control, fraud detection, and rare disease screening.

13 min read

DataScienceStatisticsHypergeometricDistributionSampleSizePowerAnalysisRareEventsQualityControlAcceptanceSampling

Sughosh Dixit•2025-11-15

Day 15 — Percentiles as Thresholds: Drawing Lines in the Sand

Percentiles provide powerful, interpretable thresholds for decision-making without distributional assumptions. Learn how to use percentiles as cutoffs for loan approvals, performance rankings, and anomaly detection—turning any feature into a ranked decision rule.

16 min read

DataScienceStatisticsPercentilesQuantilesThresholdsDecisionMakingNonparametricRobustStatistics

Sughosh Dixit•2025-11-16

Day 16 Knee Elbow Detection Finding the Sweet Spot

Elbow detection finds the sweet spot where marginal returns drop sharply—the perfect stopping point for resource allocation, customer targeting, and clustering. Learn how second derivatives reveal where 'more' becomes 'enough'.

16 min read

DataScienceStatisticsElbowDetectionKneeDetectionDiminishingReturnsOptimizationThresholdDetectionSecondDerivatives

Sughosh Dixit•2025-11-17

Day 17 — Robust Ratios and Division by Zero

Ratios are powerful but dangerous — division by values near zero can make them explode! This post shows how to design stable ratio features, guard against tiny denominators, and choose a principled epsilon using robust measures like the Median Absolute Deviation (MAD).

11 min read

DataScienceRobustStatisticsRatiosDivisionbyZeroMADStabilityNumericalMethods

Sughosh Dixit•2025-11-18

Day 18: Time and Recurrence Math - When Calendars Attack Your Data

Time is messy, calendars are inconsistent, and recurrence patterns hide traps! This post explores the common pitfalls of time-based data analysis—from unequal months to shifting weekly patterns—and provides robust Python code to handle them.

4 min read

DataScienceTimeSeriesRecurrenceSeasonalityCalendarMathNormalization

Sughosh Dixit•2025-11-19

Day 19: Precision, Recall, and F1 as Objectives

Precision, Recall, and F1 are fundamental evaluation metrics in classification. This post explores their mathematical foundations, trade-offs, and how threshold selection impacts these metrics—essential knowledge for building production ML systems.

12 min read

DataScienceMachineLearningEvaluationMetricsPrecisionRecallF1ScoreClassificationConfusionMatrix

Sughosh Dixit•2025-11-02

Day 2 — Expressions as Algebra: Tokens, Precedence, and Infix → Postfix

How to teach computers to read and evaluate expressions step by step — by tokenizing text, enforcing operator precedence, and converting rules to postfix (RPN) form for speed, clarity and consistency.

6 min read

DataScienceMathematicsParsingExpressionsAlgorithmsRuleEngines

Sughosh Dixit•2025-11-20

Day 20: Two-Feature Decision Surfaces from Rule Expressions

Visualize how rule expressions create decision surfaces in two-dimensional feature space. Learn to understand half-space intersections, orthogonal partitions, and how threshold changes affect classification regions.

10 min read

DataScienceMachineLearningDecisionBoundariesRule-BasedSystemsFeatureSpacePartitionsLatticeOrdering

Sughosh Dixit•2025-11-21

Day 21: Contingency Tables and Bin-Wise Uplift

Learn to quantify uplift and effectiveness across bins and segments using contingency tables. Understand cell counts, rates, marginalization, and how to avoid Simpson's paradox when analyzing bin-wise trends.

9 min read

DataScienceStatisticsContingencyTablesUpliftAnalysisSimpson'sParadoxBinningSegmentation

Sughosh Dixit•2025-11-22

Day 22: Set Theory and Venn Diagrams for Comparisons

Learn to measure overlap between sets using set theory fundamentals. Understand cardinalities, intersections, unions, and the Jaccard index—essential tools for comparing versions, thresholds, and events captured.

11 min read

DataScienceSetTheoryVennDiagramsJaccardIndexOverlapAnalysisSetOperationsCardinality

Sughosh Dixit•2025-11-23

Day 23: Label Post-Processing: Partitioning Flagged vs Passed Mathematically

Learn to view event tagging as rule-based classification. Understand indicator functions, piecewise partitions, and priority-level conditioning—essential tools for mathematically partitioning events into Flagged and Passed categories.

11 min read

DataScienceClassificationRule-BasedIndicatorFunctionsPartitioningPriorityLevelsPost-ProcessingThresholds

Sughosh Dixit•2025-11-24

Day 24: Risk Segmentation - Priority Tiers as Priors and Costs

Learn to interpret priority tiers as prior beliefs or cost weights. Understand cost-sensitive thresholding, Bayes optimal decision rules, and how per-tier thresholds change labeling geometry through iso-cost analysis.

11 min read

DataScienceRiskSegmentationCost-SensitiveLearningBayesThresholdDecisionTheoryClassificationPriors

Sughosh Dixit•2025-11-25

Day 25: Configuration Pairing Logic and Equivalence Classes

Learn to pair semantically complementary configuration sets like Premium/Standard and Verified/Unverified. Understand equivalence relations, pairing consistency, and how mapping functions ensure aligned parameters across pairs.

9 min read

DataScienceEquivalenceRelationsPairingLogicConfigurationSetsBipartiteMatchingSetTheory

Sughosh Dixit•2025-11-26

Day 26: From Rules to Fuzzy Logic - Why Min-Max Matters

Compare min/max logic with product t-norm and Łukasiewicz variants. Understand t-norm families, boundary behaviors, and why min/max yields conservative idempotent aggregation for rule strength evaluation.

8 min read

DataScienceFuzzyLogicT-NormsMin-MaxAggregationIdempotenceRuleEvaluation

Sughosh Dixit•2025-11-27

Day 27: Quantile Stability, Ties, and Small Samples

Master practical considerations for computing empirical quantiles. Understand how ties, discrete samples, and different interpolation schemes affect quantile estimates and threshold repeatability.

9 min read

DataScienceQuantilesECDFInterpolationTiesSmallSamplesStatisticsPercentiles

Sughosh Dixit•2025-11-28

Day 28: Robust Imputation and Numeric Coercion

Understand how numeric coercion and NA handling affect data distributions. Learn the impact of different imputation strategies on mean, variance, and quantiles for threshold-based rule evaluation.

8 min read

DataScienceImputationMissingValuesNumericCoercionDataQualityPreprocessingStatistics

Sughosh Dixit•2025-11-29

Day 29: Putting It All Together - Constructing a Stratified Audit Plan

Synthesize everything from quantile thresholds to strata to sample sizes. Learn to construct a complete stratified audit plan with cutoffs, sample sizes, and investigation workflows.

8 min read

DataScienceStratifiedSamplingAuditPlanSampleSizePowerAnalysisRiskStratification

Sughosh Dixit•2025-11-03

Day 3 — Percentiles and Quantiles: Understanding Data Distributions

Master percentiles and quantiles—simple yet powerful tools to describe data distributions. From the empirical CDF to interpolation methods, learn how these robust measures help in thresholding, outlier detection, and monitoring.

7 min read

DataScienceStatisticsPercentilesQuantilesECDFOrderStatisticsRobustness

Sughosh Dixit•2025-11-30

Day 30: A Mathematical Blueprint for Robust Decision Frameworks

A comprehensive mathematical summary mapping nonparametric statistics, robust measures, sampling theory, decision metrics, set operations, and fuzzy aggregation to their pipeline implementations.

10 min read

DataScienceMathematicalFrameworkPipelineArchitectureCalibrationSynthesisSummary

Sughosh Dixit•2025-11-04

Day 4 — Percentile Rank and Stratifications

Percentile ranks turn any numeric feature into a simple score in [0,1] that says 'what fraction of the data is at or below this value.' Learn how to combine ranks with min/max and create strata for sampling, prioritization, or analysis.

6 min read

DataScienceStatisticsPercentileRankStratificationECDFQuantilesSampling

Sughosh Dixit•2025-11-05

Day 5 — Robust Location and Scale: Median & MAD (Simple Guide + Worked Example)

The mean and standard deviation (SD) can be swayed by outliers like reeds in the wind — a single extreme value can pull them off course. The median and MAD (Median Absolute Deviation), on the other hand, are sturdy rocks in the statistical stream. They resist distortion and give reliable 'center' and 'spread' estimates, even when your data are skewed or heavy-tailed.

6 min read

DataScienceStatisticsRobustStatisticsMedianMADOutliersZ-ScoresAnomalyDetection

Sughosh Dixit•2025-11-06

Day 6 — Distribution Shape: Skewness and Kurtosis (Simple Guide + Visuals)

Skewness tells you if data lean left or right (asymmetry). Kurtosis tells you how heavy the tails are (how many extremes you see). Two datasets can share the same mean and variance but look completely different — shape features reveal the hidden story.

6 min read

DataScienceStatisticsSkewnessKurtosisDistributionShapeAsymmetryTailsOutliers

Sughosh Dixit•2025-11-07

Day 7 — Boxplots, IQR, and Tukey Fences

Boxplots are the simplest visual way to spot outliers. They rely on the IQR (Interquartile Range) — the middle 50% of your data — and build 'fences' around it. Points outside these fences are suspected outliers. It's simple, robust, and doesn't assume your data are Normal.

6 min read

DataScienceStatisticsBoxplotsIQRInterquartileRangeTukeyFencesOutliersRobustStatisticsNonparametric

Sughosh Dixit•2025-11-08

Day 8 — Adjusted Boxplots & Medcouple

Adjusted boxplots combine Tukey fences with the medcouple skewness measure so long tails do not trigger false outliers.

5 min read

DataScienceStatisticsBoxplotsAdjustedBoxplotMedcoupleSkewnessRobustStatisticsOutliersNonparametric

Sughosh Dixit•2025-11-09

Day 9 — Z-Scores vs Robust Z-Scores

Compare classical z-scores built on mean and standard deviation with robust z-scores powered by the median and MAD to see why robustness matters when data gets messy.

6 min read

DataScienceStatisticsZ-ScoreRobustStatisticsMADOutliersInfluenceBreakdownPoint

Navigation

Topics

Following

Data Science

Day 1 — Boolean Logic to Numbers: AND as min, OR as max

Day 10 — Isolation Forest: Finding Outliers by Getting Lost in the Woods

Day 11 — Kernel Density Estimation: Smoothing Out the Bumps

Day 12 — Binning and Deciles: Taming Continuous Chaos

Day 13 — Stratified Sampling: The Smart Way to Sample

Day 14 — Hypergeometric Distribution & Sample Size: Finding Needles in Haystacks

Day 15 — Percentiles as Thresholds: Drawing Lines in the Sand

Day 16 Knee Elbow Detection Finding the Sweet Spot

Day 17 — Robust Ratios and Division by Zero

Day 18: Time and Recurrence Math - When Calendars Attack Your Data

Day 19: Precision, Recall, and F1 as Objectives

Day 2 — Expressions as Algebra: Tokens, Precedence, and Infix → Postfix

Day 20: Two-Feature Decision Surfaces from Rule Expressions

Day 21: Contingency Tables and Bin-Wise Uplift

Day 22: Set Theory and Venn Diagrams for Comparisons

Day 23: Label Post-Processing: Partitioning Flagged vs Passed Mathematically

Day 24: Risk Segmentation - Priority Tiers as Priors and Costs

Day 25: Configuration Pairing Logic and Equivalence Classes

Day 26: From Rules to Fuzzy Logic - Why Min-Max Matters

Day 27: Quantile Stability, Ties, and Small Samples

Day 28: Robust Imputation and Numeric Coercion

Day 29: Putting It All Together - Constructing a Stratified Audit Plan

Day 3 — Percentiles and Quantiles: Understanding Data Distributions

Day 30: A Mathematical Blueprint for Robust Decision Frameworks

Day 4 — Percentile Rank and Stratifications

Day 5 — Robust Location and Scale: Median & MAD (Simple Guide + Worked Example)

Day 6 — Distribution Shape: Skewness and Kurtosis (Simple Guide + Visuals)

Day 7 — Boxplots, IQR, and Tukey Fences

Day 8 — Adjusted Boxplots & Medcouple

Day 9 — Z-Scores vs Robust Z-Scores