Data Generating Process (DGP)
The data generating process (DGP) is the stochastic mechanism operated by Nature that gives rise to the observable data.
In statistical learning and econometrics, we assume that this mechanism can be represented by a probability law, namely the joint probability distribution
\[ P(X,Y) \]
of the random variables involved in the problem, where:
- \(X\) = vector of features (inputs)
- \(Y\) = target variable (output)
This probability law describes the probabilistic structure governing how the data are generated.
Equivalently, we express that the random vector \((X,Y)\) follows this distribution by writing
\[ (X,Y) \sim P(X,Y) \]
Therefore, each row (or each observation) in a dataset is a realization (draw) from the true (unknown) DGP. This means that if a certain value of a random variable (e.g.: \(X\) = type of coffee) is more common in the population, it will appear more often in your dataset.
Important points:
- The DGP exists theoretically
- We do not know it
- We only observe samples generated from it
Below, we show a few ideas to make the DGP concept more concrete.
Realizations of the random variables
The distribution \(P(X,Y)\) describes the random variables \(X\) and \(Y\).
When we write
\[ P(X = x, Y = y) \]
we are evaluating the same probability law at a specific realization \((x,y)\) of those random variables.
In other words:
| Expression | Meaning |
|---|---|
| \(P(X,Y)\) | the joint probability law governing the variables |
| \(P(X=x,Y=y)\) | the probability of observing the realization \((x,y)\) |
Example
Suppose
- \(X\) = days since last purchase
- \(Y\) = customer reactivates (1 = yes, 0 = no)
Then
\[ P(X=10, Y=1) \]
represents the probability that a customer who has not purchased for 10 days reactivates their subscription.
Random Sample from the DGP
A dataset is assumed to be a random sample from the DGP:
\[ (X_1,Y_1),\dots,(X_n,Y_n) \overset{iid}{\sim} P(X,Y) \]
This means the observations are:
- Independent → one observation does not influence another
- Identically distributed → all observations follow the same probability law \(P(X,Y)\)
Each row in the dataset is therefore a realization of the random vector \((X,Y)\) generated by Nature’s stochastic mechanism.
Understanding this assumption is crucial: most statistical learning methods rely on the idea that the observed dataset is a representative sample from the same underlying DGP.
Real-world example 1: Connection to Training and Test Sets
In machine learning we typically split the dataset into training and test sets.
Both sets are assumed to be samples from the same DGP:
- Training set → used to fit the model
- Test set → used to evaluate how the model performs on unseen data
Ideally,
\[ \text{Training set} \sim P(X,Y) \]
\[ \text{Test set} \sim P(X,Y) \]
The test set therefore acts as a proxy for future observations generated by the same stochastic mechanism.
If the training and test sets come from different distributions, model evaluation can become misleading.
Real-world example 2: Observations from the DGP
Suppose we are modeling whether a customer reactivates a coffee subscription.
| customer_id | country | days_since_last_purchase | reactivated |
|---|---|---|---|
| 101 | Brazil | 15 | 1 |
| 102 | USA | 240 | 0 |
| 103 | Brazil | 40 | 1 |
| 104 | Colombia | 180 | 0 |
| 105 | USA | 20 | 1 |
Here,
\[ X = (\text{country}, \text{days\_since\_last\_purchase}) \]
\[ Y = \text{reactivated} \]
Each row is a realization drawn from the unknown joint probability law \(P(X,Y)\)
which governs how the business data are generated.
The dataset is therefore not the phenomenon itself, but rather a finite sample generated by the underlying stochastic mechanism.
Real-world example 3: Cross-Validation and the DGP
Cross-validation also relies on the assumption that the dataset is a random sample from the same data generating process (DGP).
Formally,
\[ (X_1,Y_1),\dots,(X_n,Y_n) \overset{iid}{\sim} P(X,Y) \]
Because of this assumption, we treat different subsets of the data as if they were new observations generated by the same probability law.
Dataset
| id | days_since_last_purchase | reactivated |
|---|---|---|
| 1 | 10 | 1 |
| 2 | 200 | 0 |
| 3 | 40 | 1 |
| 4 | 300 | 0 |
| 5 | 20 | 1 |
| 6 | 100 | 0 |
Each row satisfies
\[ (X_i,Y_i) \sim P(X,Y) \]
meaning every observation is a realization generated by the same DGP.
3-fold cross-validation split
The numbers below refer to the row identifiers (id) of the dataset above, indicating which observations are used in each training and validation split.
| Fold | Training data (ids) | Validation data (ids) |
|---|---|---|
| 1 | 3,4,5,6 | 1,2 |
| 2 | 1,2,5,6 | 3,4 |
| 3 | 1,2,3,4 | 5,6 |
For example, in Fold 1:
- the model is trained on observations 3,4,5,6
- it is evaluated on observations 1,2
At each iteration:
- the model is trained on the training subset
- its performance is evaluated on the validation subset
Because all observations are assumed to come from the same DGP, each validation fold acts as a proxy for new data generated by Nature.
Cross-validation therefore repeats the train/test logic multiple times, providing a more stable estimate of model performance.