Empirical Distribution

A simple, from-scratch explanation of the empirical distribution as probabilities derived from observed frequencies in data.

The empirical distribution is the probability distribution obtained by assigning probabilities according to the observed frequencies in the data.

In other words:

Probability = how often a value appears in the dataset


From Data to Probabilities

Suppose we observe a sample:

80, 82, 85, 85, 90

We start by counting how often each value appears:

  • 80 → 1 time
  • 82 → 1 time
  • 85 → 2 times
  • 90 → 1 time

Total observations = 5

Now we convert counts into probabilities:

\[ P(X = x) = \frac{\text{count of } x}{n} \]

So:

Flavor Score Probability
80 1/5
82 1/5
85 2/5
90 1/5

Interpretation

The empirical distribution answers:

“If I randomly pick one observation from my dataset, what is the probability of each value?”

You can think of it as:

  • Writing each observation on a ticket
  • Putting all tickets in a bag
  • Drawing one at random

Since 85 appears twice, it has twice the chance of being selected.


Why Counting Works

When we count, we are implicitly:

  • Assigning equal weight to each observation
  • Treating each observation as equally likely

This is equivalent to assuming a uniform distribution over the observed data points.

Frequencies then emerge naturally when values repeat.


Connection to the Data Generating Process

In practice, data are generated by an unknown Data Generating Process (DGP), which defines a true distribution \(F\).

We observe a sample:

\[ X_1, X_2, \dots, X_n \sim F \]

The distribution \(F\) describes the mechanism that generates the data.

However, in practice:

  • \(F\) is unknown
  • We only observe a finite dataset

So while \(F\) defines the problem, we cannot work with it directly.


The Plug-in Principle

In statistics, many quantities of interest depend on the unknown distribution \(F\).

For example:

  • the mean
  • the variance
  • the sampling distribution of a statistic

The plug-in principle provides a simple idea:

To estimate a quantity defined in terms of \(F\), replace \(F\) with an estimate.

In the simplest case, we use the empirical distribution \(\hat{F}\).

So instead of working with \(F\), we work with:

\[ F \;\; \rightarrow \;\; \hat{F} \]

Importantly:

The plug-in principle does not remove the role of \(F\) — it approximates it using the data.


Why This Matters

The empirical distribution is not just a way to summarize data.

It is a data-driven approximation of the true distribution.

In this sense, the empirical distribution acts as a proxy for the unknown data-generating process.

This allows us to:

  • compute estimates
  • approximate expectations
  • simulate new data

without making parametric assumptions.


Connection to Bootstrap

The bootstrap is a direct application of the plug-in principle.

  • The true sampling process depends on \(F\)
  • We replace \(F\) with \(\hat{F}\)
  • We simulate new samples from \(\hat{F}\)

In practice, this means:

Resampling the observed data is equivalent to sampling from the empirical distribution.


Key Idea

The empirical distribution is:

  • A nonparametric estimate of the unknown distribution
  • Defined entirely by the data
  • A proxy for the data-generating process
  • The foundation for resampling methods like the bootstrap

One-line Summary

The empirical distribution approximates the unknown data-generating process by assigning probability mass to observed data points, enabling computation and simulation through the plug-in principle.