Foundations of Data Science & Causal ML: A Mathematical Journey

Author

Caio Velasco

Published

October 9, 2025

© 2025 Caio Velasco. All rights reserved.

This is my structured learning roadmap to prepare for research in causal machine learning with mathematical rigor.

Here you can find both mathematical foundations and applications.

Journey Phases

Why I’m Interested in Causal ML?

Goal: Show the reader the contrasts between two “Data Cultures”.
- Statistics and econometrics build models on top of stochastic mechanisms (data-generating processes), aiming for explanation and inference.
- Machine learning often ignores mechanisms and focuses on prediction accuracy, generalization, and performance.
- Causal ML is a synthesis: it combines ML’s flexibility with statistics’ concern for identification, adding a causal lens to reason about interventions and counterfactuals.

Phase 1 – Logic & Set Theory

Goal: Build comfort with the language of mathematics.
- Proof techniques: direct, contrapositive, contradiction, induction.
- Sets and families of sets, Cartesian products, power sets.
- Functions: injective, surjective, bijective.
- Relations: equivalence relations, partial orders.
- Cardinality: countable vs. uncountable sets.

Theory Output: - Sentential logic (Velleman Ch. 1) - Predicate logic & quantifiers (Ch. 2) - Proof techniques (direct, contrapositive, contradiction, induction) (Ch. 3) - Sets, relations, functions (Ch. 4–6) - Countability, infinity (Ch. 9 selected, Tao’s appendix)

Application Project: - SQL/database operations as set theory (joins, unions, intersections). - Prove/discuss equivalences (e.g., idempotency: SELECT DISTINCT twice = once)

References:
- Velleman – How to Prove It
- Enderton – Set Theory

Phase 2 – Real Analysis

Goal: Rigorous calculus and convergence, revisiting classical calculus concepts with proofs.

Sequences, series, limits.
Continuity, compactness, connectedness.
Differentiation: Mean Value Theorem, Taylor expansion.
Riemann integration (rigorous foundation).
Uniform convergence.

Applied Calculus Lens:
- Multivariable calculus: partial derivatives, gradients, Jacobians, Hessians.
- Convexity and optimization.
- Taylor expansions for approximation.
- Fundamental Theorem of Calculus as link to probability expectations.

Theory Output: ε–δ proofs, compactness in ℝ, uniform convergence examples.
Application Project: Gradient descent convergence demo; connect convexity to logistic regression loss.

References:
- Rudin – Principles of Mathematical Analysis (Baby Rudin)
- Tao – Analysis I

Phase 3 – Linear Algebra

Goal: Move beyond computation to proofs and structure.
- Vector spaces, subspaces, linear independence, bases, dimension.
- Linear transformations and matrices.
- Inner product spaces, orthogonality, Gram–Schmidt.
- Determinants, eigenvalues, eigenvectors, diagonalization.
- Spectral theorem, singular value decomposition.
- Matrix norms and conditioning.

Theory Output: Proofs of rank–nullity theorem, spectral theorem for symmetric matrices, SVD existence.
Application Project: PCA from first principles — prove orthogonal diagonalization, then implement PCA via SVD.

References:
- Axler – Linear Algebra Done Right
- Friedberg, Insel & Spence – Linear Algebra
- Trefethen & Bau – Numerical Linear Algebra (for computational aspects)

Phase 4 – Functional Analysis & Hilbert Spaces

Goal: Develop the tools to handle infinite-dimensional vector spaces, operators, and kernels.
- Normed vector spaces, Banach spaces.
- Hilbert spaces, orthogonality, projections.
- Bounded linear operators.
- Reproducing Kernel Hilbert Spaces (RKHS).

Theory Output: Prove projection theorem in Hilbert spaces, examples of bounded/unbounded operators, RKHS construction.
Application Project: Kernelized regression and SVMs — connect functional analysis with machine learning models.

References:
- Kreyszig – Introductory Functional Analysis with Applications
- Conway – A Course in Functional Analysis
- Berlinet & Thomas-Agnan – Reproducing Kernel Hilbert Spaces in Probability and Statistics

Phase 5 – Topology & Measure Theory

Goal: Learn the structures that underlie probability theory.
- Metric spaces, open/closed sets.
- Compactness and product spaces.
- σ-algebras, measurable functions.
- Lebesgue measure and integration.
- Convergence theorems: MCT, DCT.

Theory Output: Worked examples of σ-algebras, Lebesgue integral, and convergence theorems.
Application Project: Fraud detection via Monte Carlo — rare events and measure-zero sets in anomaly detection.

References:
- Munkres – Topology
- Schilling – Measures, Integrals and Martingales

Phase 6 – Probability

Goal: Define probability rigorously à la Kolmogorov.
- Probability spaces and random variables as measurable functions.
- Distributions, independence, product measures.
- Conditional expectation as L² projection.
- Laws of large numbers, central limit theorem.
- Intro to martingales.

Theory Output: Probability space construction, LLN/CLT proofs, conditional expectation as projection.
Application Project: A/B testing simulation — CLT and confidence intervals for conversion rates.

References:
- Durrett – Probability: Theory and Examples
- Klenke – Probability Theory

Phase 7 – Mathematical Statistics

Goal: Connect probability → inference.
- Point estimation: MLE, method of moments.
- Properties: unbiasedness, consistency, efficiency.
- Hypothesis testing and likelihood ratio tests.
- Asymptotic results: convergence in probability/distribution, delta method.

Theory Output: Consistency of MLE, hypothesis testing framework, asymptotic normality proofs.
Application Project: Logistic regression for churn prediction — prove Bernoulli MLE consistency, simulate convergence, apply to real dataset.

References:
- Casella & Berger – Statistical Inference

Phase 8 – Statistical Learning Theory

Goal: Bridge inference and prediction by introducing the statistical foundations of machine learning.
Show how algorithms that learn from data can be rigorously analyzed using probability and measure theory—without assuming a fixed data-generating model.

Learning as an inference problem: from data pairs of inputs and outputs, we learn a rule that performs well on unseen data.
Loss and risk: evaluate models by how well predictions match reality, on average.
Empirical Risk Minimization (ERM): choose the model that best fits the observed data.
Generalization: ensure models trained on one sample also perform well on new data.
Capacity control: limit model flexibility (via VC dimension or similar) to avoid overfitting.
Structural Risk Minimization (SRM): balance fit and simplicity with a complexity penalty.
Regularization: constrain solutions to remain smooth or stable, as in Hilbert-space norms.
Bias–variance trade-off: find the sweet spot between underfitting and overfitting.
Bridge: Breiman’s “two cultures” meet in Vapnik’s framework, unifying inference and prediction.

Theory Output:
- Derive a simple generalization bound using Hoeffding’s inequality and the union bound.
- Compute VC dimension for linear separators.
- Prove that L2 regularization corresponds to minimizing empirical risk under a bounded-norm constraint.

Application Project:
- Simulate empirical vs. generalization error on synthetic data.
- Train classifiers with varying complexity (polynomial degree or kernel width) and plot training/test errors.
- Show visually how regularization reduces overfitting and aligns with theoretical bounds.

References:
- Vapnik – The Nature of Statistical Learning Theory
- Bousquet, Boucheron & Lugosi – Introduction to Statistical Learning Theory
- Hastie, Tibshirani & Friedman – The Elements of Statistical Learning
- Shalev-Shwartz & Ben-David – Understanding Machine Learning: From Theory to Algorithms
- Breiman – Statistical Modeling: The Two Cultures

Phase 9 – Causality

Goal: Enter causal inference with strong mathematical foundations.
- Pearl’s Structural Causal Models & do-calculus.
- Rubin’s potential outcomes framework.
- Invariant causal prediction (Peters, Janzing, Schölkopf).
- Identifiability proofs.
- Axiomatic frameworks (Park & Muandet).

Theory Output: Worked proofs of identifiability, back-door/front-door criteria, do-calculus rules.
Application Project: Uplift modeling for churn retention OR reproduction of Chernozhukov’s Double Machine Learning estimator with Python.

References:
- Pearl – Causality
- Peters, Janzing, Schölkopf – Elements of Causal Inference
- Chernozhukov et al. – Causal Machine Learning papers