Data Generating Process (DGP)

The probabilistic mechanism that generates the observed data.

The data generating process (DGP) is the stochastic mechanism operated by Nature that gives rise to the observable data.

In statistical learning and econometrics, we assume that this mechanism can be represented by a probability law, namely the joint probability distribution

\[ P(X,Y) \]

of the random variables involved in the problem, where:

\(X\) = vector of features (inputs)
\(Y\) = target variable (output)

This probability law describes the probabilistic structure governing how the data are generated.

Equivalently, we express that the random vector \((X,Y)\) follows this distribution by writing

\[ (X,Y) \sim P(X,Y) \]

Therefore, each row (or each observation) in a dataset is a realization (draw) from the true (unknown) DGP. This means that if a certain value of a random variable (e.g.: \(X\) = type of coffee) is more common in the population, it will appear more often in your dataset.

Important points:

The DGP exists theoretically
We do not know it
We only observe samples generated from it

Below, we show a few ideas to make the DGP concept more concrete.

Realizations of the random variables

The distribution \(P(X,Y)\) describes the random variables \(X\) and \(Y\).

When we write

\[ P(X = x, Y = y) \]

we are evaluating the same probability law at a specific realization \((x,y)\) of those random variables.

In other words:

Expression	Meaning
\(P(X,Y)\)	the joint probability law governing the variables
\(P(X=x,Y=y)\)	the probability of observing the realization \((x,y)\)

Example

Suppose

\(X\) = days since last purchase
\(Y\) = customer reactivates (1 = yes, 0 = no)

Then

\[ P(X=10, Y=1) \]

represents the probability that a customer who has not purchased for 10 days reactivates their subscription.

Random Sample from the DGP

A dataset is assumed to be a random sample from the DGP:

\[ (X_1,Y_1),\dots,(X_n,Y_n) \overset{iid}{\sim} P(X,Y) \]

This means the observations are:

Independent → one observation does not influence another
Identically distributed → all observations follow the same probability law \(P(X,Y)\)

Each row in the dataset is therefore a realization of the random vector \((X,Y)\) generated by Nature’s stochastic mechanism.

Understanding this assumption is crucial: most statistical learning methods rely on the idea that the observed dataset is a representative sample from the same underlying DGP.

Real-world example 1: Connection to Training and Test Sets

In machine learning we typically split the dataset into training and test sets.

Both sets are assumed to be samples from the same DGP:

Training set → used to fit the model
Test set → used to evaluate how the model performs on unseen data

Ideally,

\[ \text{Training set} \sim P(X,Y) \]

\[ \text{Test set} \sim P(X,Y) \]

The test set therefore acts as a proxy for future observations generated by the same stochastic mechanism.

If the training and test sets come from different distributions, model evaluation can become misleading.

Real-world example 2: Observations from the DGP

Suppose we are modeling whether a customer reactivates a coffee subscription.

customer_id	country	days_since_last_purchase	reactivated
101	Brazil	15	1
102	USA	240	0
103	Brazil	40	1
104	Colombia	180	0
105	USA	20	1

Here,

\[ X = (\text{country}, \text{days\_since\_last\_purchase}) \]

\[ Y = \text{reactivated} \]

Each row is a realization drawn from the unknown joint probability law \(P(X,Y)\)

which governs how the business data are generated.

The dataset is therefore not the phenomenon itself, but rather a finite sample generated by the underlying stochastic mechanism.

Real-world example 3: Cross-Validation and the DGP

Cross-validation also relies on the assumption that the dataset is a random sample from the same data generating process (DGP).

Formally,

\[ (X_1,Y_1),\dots,(X_n,Y_n) \overset{iid}{\sim} P(X,Y) \]

Because of this assumption, we treat different subsets of the data as if they were new observations generated by the same probability law.

Dataset

id	days_since_last_purchase	reactivated
1	10	1
2	200	0
3	40	1
4	300	0
5	20	1
6	100	0

Each row satisfies

\[ (X_i,Y_i) \sim P(X,Y) \]

meaning every observation is a realization generated by the same DGP.

3-fold cross-validation split

The numbers below refer to the row identifiers (id) of the dataset above, indicating which observations are used in each training and validation split.

Fold	Training data (ids)	Validation data (ids)
1	3,4,5,6	1,2
2	1,2,5,6	3,4
3	1,2,3,4	5,6

For example, in Fold 1:

the model is trained on observations 3,4,5,6
it is evaluated on observations 1,2

At each iteration:

the model is trained on the training subset
its performance is evaluated on the validation subset

Because all observations are assumed to come from the same DGP, each validation fold acts as a proxy for new data generated by Nature.

Cross-validation therefore repeats the train/test logic multiple times, providing a more stable estimate of model performance.