Sampling Distribution
In probability, we usually study the distribution of a random variable.
For example, when rolling a fair die, we may ask:
What is the probability that the outcome is 4?
Here, the object of interest is the distribution of a single random variable.
From Random Variables to Statistics
In statistical inference, the focus shifts.
Instead of studying a single observation, we study statistics computed from samples.
A useful way to think about this shift is:
When you roll one die, the natural question is about the distribution of the outcome.
When you roll several dice and compute a statistic, the question becomes the distribution of that statistic.
Definition
Suppose we collect data as a random sample:
\[ X_1, X_2, \dots, X_n \sim F \]
and compute a statistic:
\[ T = T(X_1, \dots, X_n) \]
The sampling distribution is the probability distribution of \(T\).
Interpretation
The sampling distribution answers the question:
“How would this statistic vary if we collected many different datasets from the same data-generating process?”
It is a theoretical object defined by repeated sampling from the Data Generating Process (DGP).
In practice:
- We observe only one dataset
- But we reason about many possible datasets
Example
Suppose we roll a die four times and compute the sample mean.
Each set of four rolls produces a different value:
- Sample 1 → mean = 3.25
- Sample 2 → mean = 4.50
- Sample 3 → mean = 2.75
If we repeated this experiment many times, we would obtain many possible values.
The distribution of those values is the sampling distribution of the sample mean.
Two Ways to Access the Sampling Distribution
The sampling distribution is usually not directly observable.
There are two main approaches to understanding it.
1) Classical (Analytical Approach)
In classical statistics, we derive properties of the sampling distribution using mathematical theory.
A fundamental result is the Central Limit Theorem (CLT):
- For large samples, the sample mean is approximately normal
- Its variability depends on the variance and sample size
This provides analytical formulas for:
- standard errors
- confidence intervals
- hypothesis tests
However, this approach:
- relies on assumptions
- works mainly for simple statistics (e.g., the mean)
2) Computational Approach (Bootstrap)
In many practical cases, the sampling distribution is difficult or impossible to derive analytically.
The bootstrap provides an alternative:
- Resample the data
- Recompute the statistic
- Use the resulting values to approximate its distribution
The bootstrap approximates the sampling distribution by simulation rather than derivation.
More generally:
Bootstrap is a computational shortcut to approximate the same object that classical statistics defines theoretically.
Why Sampling Distributions Matter
Sampling distributions allow us to:
- Quantify uncertainty
- Compute standard errors
- Build confidence intervals
- Perform hypothesis testing
They are central to both:
- classical statistical inference
- modern data science workflows
Where Sampling Distributions Appear in Data Science
Although the concept originates in classical statistics, it appears frequently in modern applications:
Model evaluation
Metrics such as accuracy, AUC, precision, or recall are statistics.
Their variability across datasets reflects their sampling distribution.Cross-validation
Repeated training and evaluation across folds approximates the sampling distribution of model performance.Bootstrap methods
Resampling approximates the sampling distribution of a statistic.Uncertainty quantification
Confidence intervals and standard errors rely on sampling distributions.
Key Idea
The sampling distribution is the central object of statistical inference.
- Classical statistics approximates it analytically
- Bootstrap approximates it computationally
One-line Summary
The sampling distribution describes how a statistic varies across repeated samples from the same data-generating process, and it can be studied either analytically or approximated computationally.