Random Sample

A collection of independent and identically distributed random variables drawn from the same data-generating process.

Statistical analysis begins with a random phenomenon, where each observation arises from a repeatable process.

Random Phenomenon and Regularity

We consider processes that can be repeated under similar conditions:

Rolling two dice
Measuring the quality of a coffee
Observing user behavior

Each repetition produces an outcome.

When this process is repeated many times, patterns begin to emerge:

Outcomes follow stable frequencies
Regularities appear in the data

👉 This is known as statistical regularity

Example (dice):

Suppose we roll two dice many times and record the sum.

After a few rolls → results look irregular (e.g., 5, 9, 3, 11)
After many rolls → a pattern emerges

For example, after 10,000 rolls, we might observe:

Sum	Frequency
2	~2.8%
3	~5.6%
7	~16.7%
12	~2.8%

We see that:

Some outcomes occur more frequently than others in a stable way.

In particular, the sum 7 appears most often, with frequency close to \(1/6\).

This stable pattern is what allows us to model the process using a probability distribution.

The Data Generating Process (DGP)

We represent this process using a probability model:

\[ X \sim F \]

\(X\) represents the outcome of a single observation
\(F\) represents the distribution governing the process

👉 The Data Generating Process (DGP) is a mathematical abstraction of the mechanism that generates the data.

Definition of a Random Sample

To model repeated data collection, we define a sequence of random variables:

\[ X_1, X_2, \dots, X_n \sim F \]

such that:

Each \(X_i\) has the same distribution \(F\)
The variables are independent

This is called a random sample, or:

\(X_1, \dots, X_n\) are independent and identically distributed (i.i.d.)

Interpretation

A random sample represents repeated draws from the same data-generating process.

Formally, this is expressed by the assumption that:

\(X_1, \dots, X_n\) are independent and identically distributed (i.i.d.)

This means:

Each \(X_i\) corresponds to one observation
All \(X_i\) have the same distribution \(F\) (identically distributed)
No observation influences another (independent)

Importantly:

The random sample is a theoretical construct, not the observed data.

From Random Sample to Data

What we actually observe is a realization of the random sample:

\[ x_1, x_2, \dots, x_n \]

where each \(x_i\) is a realization of \(X_i\).

This realization forms the dataset.

The dataset is a realization of the random sample generated by the data-generating process.

Connection to Data Science

In many applications, each observation contains multiple features. In this case:

Each \(X_i\) is a random vector
Each \(x_i\) is a row in the dataset

For example, in a coffee dataset:

\[ X_i = (\text{rating}, \text{acidity}, \text{body}) \]

Then:

Rows → realizations of \(X_i\)
Columns → components of \(X_i\)

Key Idea

A random sample:

Formalizes repeated data collection
Connects data to the underlying process
Provides the foundation for statistical inference

One-line Summary

A random sample is a collection of independent and identically distributed random variables representing repeated draws from the same data-generating process.