Empirical Distribution
The empirical distribution is the probability distribution obtained by assigning probabilities according to the observed frequencies in the data.
In other words:
Probability = how often a value appears in the dataset
From Data to Probabilities
Suppose we observe a sample:
80, 82, 85, 85, 90
We start by counting how often each value appears:
- 80 → 1 time
- 82 → 1 time
- 85 → 2 times
- 90 → 1 time
Total observations = 5
Now we convert counts into probabilities:
\[ P(X = x) = \frac{\text{count of } x}{n} \]
So:
| Flavor Score | Probability |
|---|---|
| 80 | 1/5 |
| 82 | 1/5 |
| 85 | 2/5 |
| 90 | 1/5 |
Interpretation
The empirical distribution answers:
“If I randomly pick one observation from my dataset, what is the probability of each value?”
You can think of it as:
- Writing each observation on a ticket
- Putting all tickets in a bag
- Drawing one at random
Since 85 appears twice, it has twice the chance of being selected.
Why Counting Works
When we count, we are implicitly:
- Assigning equal weight to each observation
- Treating each observation as equally likely
This is equivalent to assuming a uniform distribution over the observed data points.
Frequencies then emerge naturally when values repeat.
Connection to the Data Generating Process
In practice, data are generated by an unknown Data Generating Process (DGP), which defines a true distribution \(F\).
We observe a sample:
\[ X_1, X_2, \dots, X_n \sim F \]
The distribution \(F\) describes the mechanism that generates the data.
However, in practice:
- \(F\) is unknown
- We only observe a finite dataset
So while \(F\) defines the problem, we cannot work with it directly.
The Plug-in Principle
In statistics, many quantities of interest depend on the unknown distribution \(F\).
For example:
- the mean
- the variance
- the sampling distribution of a statistic
The plug-in principle provides a simple idea:
To estimate a quantity defined in terms of \(F\), replace \(F\) with an estimate.
In the simplest case, we use the empirical distribution \(\hat{F}\).
So instead of working with \(F\), we work with:
\[ F \;\; \rightarrow \;\; \hat{F} \]
Importantly:
The plug-in principle does not remove the role of \(F\) — it approximates it using the data.
Why This Matters
The empirical distribution is not just a way to summarize data.
It is a data-driven approximation of the true distribution.
In this sense, the empirical distribution acts as a proxy for the unknown data-generating process.
This allows us to:
- compute estimates
- approximate expectations
- simulate new data
without making parametric assumptions.
Connection to Bootstrap
The bootstrap is a direct application of the plug-in principle.
- The true sampling process depends on \(F\)
- We replace \(F\) with \(\hat{F}\)
- We simulate new samples from \(\hat{F}\)
In practice, this means:
Resampling the observed data is equivalent to sampling from the empirical distribution.
Key Idea
The empirical distribution is:
- A nonparametric estimate of the unknown distribution
- Defined entirely by the data
- A proxy for the data-generating process
- The foundation for resampling methods like the bootstrap
One-line Summary
The empirical distribution approximates the unknown data-generating process by assigning probability mass to observed data points, enabling computation and simulation through the plug-in principle.