Random Sample
Statistical analysis begins with a random phenomenon, where each observation arises from a repeatable process.
Random Phenomenon and Regularity
We consider processes that can be repeated under similar conditions:
- Rolling two dice
- Measuring the quality of a coffee
- Observing user behavior
Each repetition produces an outcome.
When this process is repeated many times, patterns begin to emerge:
- Outcomes follow stable frequencies
- Regularities appear in the data
š This is known as statistical regularity
Example (dice):
Suppose we roll two dice many times and record the sum.
- After a few rolls ā results look irregular (e.g., 5, 9, 3, 11)
- After many rolls ā a pattern emerges
For example, after 10,000 rolls, we might observe:
| Sum | Frequency |
|---|---|
| 2 | ~2.8% |
| 3 | ~5.6% |
| 7 | ~16.7% |
| 12 | ~2.8% |
We see that:
Some outcomes occur more frequently than others in a stable way.
In particular, the sum 7 appears most often, with frequency close to \(1/6\).
This stable pattern is what allows us to model the process using a probability distribution.
The Data Generating Process (DGP)
We represent this process using a probability model:
\[ X \sim F \]
- \(X\) represents the outcome of a single observation
- \(F\) represents the distribution governing the process
š The Data Generating Process (DGP) is a mathematical abstraction of the mechanism that generates the data.
Definition of a Random Sample
To model repeated data collection, we define a sequence of random variables:
\[ X_1, X_2, \dots, X_n \sim F \]
such that:
- Each \(X_i\) has the same distribution \(F\)
- The variables are independent
This is called a random sample, or:
\(X_1, \dots, X_n\) are independent and identically distributed (i.i.d.)
Interpretation
A random sample represents repeated draws from the same data-generating process.
Formally, this is expressed by the assumption that:
\(X_1, \dots, X_n\) are independent and identically distributed (i.i.d.)
This means:
- Each \(X_i\) corresponds to one observation
- All \(X_i\) have the same distribution \(F\) (identically distributed)
- No observation influences another (independent)
Importantly:
The random sample is a theoretical construct, not the observed data.
From Random Sample to Data
What we actually observe is a realization of the random sample:
\[ x_1, x_2, \dots, x_n \]
where each \(x_i\) is a realization of \(X_i\).
This realization forms the dataset.
The dataset is a realization of the random sample generated by the data-generating process.
Connection to Data Science
In many applications, each observation contains multiple features. In this case:
- Each \(X_i\) is a random vector
- Each \(x_i\) is a row in the dataset
For example, in a coffee dataset:
\[ X_i = (\text{rating}, \text{acidity}, \text{body}) \]
Then:
- Rows ā realizations of \(X_i\)
- Columns ā components of \(X_i\)
Key Idea
A random sample:
- Formalizes repeated data collection
- Connects data to the underlying process
- Provides the foundation for statistical inference
One-line Summary
A random sample is a collection of independent and identically distributed random variables representing repeated draws from the same data-generating process.