Sampling Distribution

The probability distribution of a statistic computed from repeated samples drawn from the same data-generating process.

In probability, we usually study the distribution of a random variable.

For example, when rolling a fair die, we may ask:

What is the probability that the outcome is 4?

Here, the object of interest is the distribution of a single random variable.

From Random Variables to Statistics

In statistical inference, the focus shifts.

Instead of studying a single observation, we study statistics computed from samples.

A useful way to think about this shift is:

When you roll one die, the natural question is about the distribution of the outcome.
When you roll several dice and compute a statistic, the question becomes the distribution of that statistic.

Definition

Suppose we collect data as a random sample:

\[ X_1, X_2, \dots, X_n \sim F \]

and compute a statistic:

\[ T = T(X_1, \dots, X_n) \]

The sampling distribution is the probability distribution of \(T\).

Interpretation

The sampling distribution answers the question:

“How would this statistic vary if we collected many different datasets from the same data-generating process?”

It is a theoretical object defined by repeated sampling from the Data Generating Process (DGP).

In practice:

We observe only one dataset
But we reason about many possible datasets

Example

Suppose we roll a die four times and compute the sample mean.

Each set of four rolls produces a different value:

Sample 1 → mean = 3.25
Sample 2 → mean = 4.50
Sample 3 → mean = 2.75

If we repeated this experiment many times, we would obtain many possible values.

The distribution of those values is the sampling distribution of the sample mean.

Two Ways to Access the Sampling Distribution

The sampling distribution is usually not directly observable.

There are two main approaches to understanding it.

1) Classical (Analytical Approach)

In classical statistics, we derive properties of the sampling distribution using mathematical theory.

A fundamental result is the Central Limit Theorem (CLT):

For large samples, the sample mean is approximately normal
Its variability depends on the variance and sample size

This provides analytical formulas for:

standard errors
confidence intervals
hypothesis tests

However, this approach:

relies on assumptions
works mainly for simple statistics (e.g., the mean)

2) Computational Approach (Bootstrap)

In many practical cases, the sampling distribution is difficult or impossible to derive analytically.

The bootstrap provides an alternative:

Resample the data
Recompute the statistic
Use the resulting values to approximate its distribution

The bootstrap approximates the sampling distribution by simulation rather than derivation.

More generally:

Bootstrap is a computational shortcut to approximate the same object that classical statistics defines theoretically.

Why Sampling Distributions Matter

Sampling distributions allow us to:

Quantify uncertainty
Compute standard errors
Build confidence intervals
Perform hypothesis testing

They are central to both:

classical statistical inference
modern data science workflows

Where Sampling Distributions Appear in Data Science

Although the concept originates in classical statistics, it appears frequently in modern applications:

Model evaluation
Metrics such as accuracy, AUC, precision, or recall are statistics.
Their variability across datasets reflects their sampling distribution.
Cross-validation
Repeated training and evaluation across folds approximates the sampling distribution of model performance.
Bootstrap methods
Resampling approximates the sampling distribution of a statistic.
Uncertainty quantification
Confidence intervals and standard errors rely on sampling distributions.

Key Idea

The sampling distribution is the central object of statistical inference.

Classical statistics approximates it analytically
Bootstrap approximates it computationally

One-line Summary

The sampling distribution describes how a statistic varies across repeated samples from the same data-generating process, and it can be studied either analytically or approximated computationally.