Random Sample

A collection of independent and identically distributed random variables drawn from the same data-generating process.

Statistical analysis begins with a random phenomenon, where each observation arises from a repeatable process.

Random Phenomenon and Regularity

We consider processes that can be repeated under similar conditions:

  • Rolling two dice
  • Measuring the quality of a coffee
  • Observing user behavior

Each repetition produces an outcome.

When this process is repeated many times, patterns begin to emerge:

  • Outcomes follow stable frequencies
  • Regularities appear in the data

šŸ‘‰ This is known as statistical regularity

Example (dice):

Suppose we roll two dice many times and record the sum.

  • After a few rolls → results look irregular (e.g., 5, 9, 3, 11)
  • After many rolls → a pattern emerges

For example, after 10,000 rolls, we might observe:

Sum Frequency
2 ~2.8%
3 ~5.6%
7 ~16.7%
12 ~2.8%

We see that:

Some outcomes occur more frequently than others in a stable way.

In particular, the sum 7 appears most often, with frequency close to \(1/6\).

This stable pattern is what allows us to model the process using a probability distribution.


The Data Generating Process (DGP)

We represent this process using a probability model:

\[ X \sim F \]

  • \(X\) represents the outcome of a single observation
  • \(F\) represents the distribution governing the process

šŸ‘‰ The Data Generating Process (DGP) is a mathematical abstraction of the mechanism that generates the data.


Definition of a Random Sample

To model repeated data collection, we define a sequence of random variables:

\[ X_1, X_2, \dots, X_n \sim F \]

such that:

  • Each \(X_i\) has the same distribution \(F\)
  • The variables are independent

This is called a random sample, or:

\(X_1, \dots, X_n\) are independent and identically distributed (i.i.d.)


Interpretation

A random sample represents repeated draws from the same data-generating process.

Formally, this is expressed by the assumption that:

\(X_1, \dots, X_n\) are independent and identically distributed (i.i.d.)

This means:

  • Each \(X_i\) corresponds to one observation
  • All \(X_i\) have the same distribution \(F\) (identically distributed)
  • No observation influences another (independent)

Importantly:

The random sample is a theoretical construct, not the observed data.


From Random Sample to Data

What we actually observe is a realization of the random sample:

\[ x_1, x_2, \dots, x_n \]

where each \(x_i\) is a realization of \(X_i\).

This realization forms the dataset.

The dataset is a realization of the random sample generated by the data-generating process.


Connection to Data Science

In many applications, each observation contains multiple features. In this case:

  • Each \(X_i\) is a random vector
  • Each \(x_i\) is a row in the dataset

For example, in a coffee dataset:

\[ X_i = (\text{rating}, \text{acidity}, \text{body}) \]

Then:

  • Rows → realizations of \(X_i\)
  • Columns → components of \(X_i\)

Key Idea

A random sample:

  • Formalizes repeated data collection
  • Connects data to the underlying process
  • Provides the foundation for statistical inference

One-line Summary

A random sample is a collection of independent and identically distributed random variables representing repeated draws from the same data-generating process.