Caio Velasco

I’m a Mechanical Engineer from Brazil with a Master in Economics & Public Policy from the University of California Los Angeles (UCLA).

In 2021, because of the pandemic, I decided to stop a Ph.D. in Economics in the Netherlands and transition into Data Science.

My professional experience ranges from Business (Business Analysis, Product Ownership, and Consulting) to Tech (Data Engineering and Data Science). I also have very good Communication, Teaching and Leadership skills.

My academic experience includes Advanced Math, Statistics, Economics, and Econometrics (Multiple Regression, Causal Inference, Difference in Differences, Instrumental Variables, Time Series) as well as basic knowledge of Corporate Finance and Financial Markets.

I grew up in Rio de Janeiro and education was the only weapon I had to succeed in life. As a consequence of my efforts, I was awarded by Yale University, UCLA, General Electric Foundation, Lemann Foundation, and The Club of Rome.

Moreover, I bring strong entrepreneurial skills to the table. In the past, I helped build Stone Payments, a famous Brazilian FinTech payment company (NASDAQ: STNE), and I created my own mathematics digital product (MePrepara) from scratch with over 140 videos teaching low-income Brazilian students who need to take the GMAT and GRE exams to apply for MBAs/ or PhDs abroad.

Thank you for visiting :)

Projects

Data & Analytics Engineering
- Tech Stack: SQL, Python, Airbyte, dbt, Snowflake, AWS
- Implementing CDC with SCD Techniques: CDC Source -> SCD Type 2
- Lead Quality Process: AWS S3 Bucket (Parquet, CSVs) -> Postgres
- Part 1 - Ingestion - dbt-DuckDB: Postgres -> AWS S3 Bucket (Parquet)
- Part 2 - Transformation - dbt-Snowflake: AWS S3 Bucket (Parquet) -> Snowflake External Tables
- ETL for Machine Learning (Churn Prediction)
- Migrating ETL to dbt
- ETL from Crypto API to Tableau
Applied Data Science
- Tech Stack: Python (Pandas, Numpy, Statsmodels, Scikit-Learn, CausalInference)
- Causal Inference: Effect of a New Recommendation System
- Causal Inference: Effect of a Customer-Satisfaction Program
Data Analytics with Python (Best Practices)
- Data Cleaning - Preparing Categorical Data for Modeling
- Data Cleaning - Parsing Date and Time Zone for Modeling
- Under construction!

See all projects below!

Data Engineering

Implementing an SCD Type 2 dimension from a CDC source using Snowflakes’s Stored procedure and Data Quality Checks.

Check it out here!

This task involves implementing a Slowly Changing Dimension (SCD) Type 2 to track changes to a product’s status over time within Snowflake. The source for this dimension is a Change Data Capture (CDC) stream that logs all data modification events (DML operations) from a transactional system. The main goal is to maintain historical records of product status changes, based on an ordered and deduplicated stream of changes assuring idempotency and with basic data quality checks.

Lead Quality Process (Reading Parquet and CSVs from S3 -> Transforming with Object-Oriented Design -> Postgres (Bronze, Silver, Gold Layers)).

Check it out here!

This projects uses a Dockerized environment to extract data both Parquet and CSV Data from S3 Buckets to Load and Transform them in PostgreSQL, following the Medallion Architecture.

Part 1 of 2 - Leveraging dbt-DuckDB to perform an Ingestion Step (Postgres -> AWS S3 Bucket (Parquet)).

Check it out here!

This projects uses a Dockerized environment to extract data from Postgres (as if it were data in “Production”). Then, it converts the data into a Parquet files, saving them into AWS S3 Bucket. I used my AWS Free Tier account and implemented the dbt-DuckDB adapter to expand dbt’s core function (the Transformation step) into an Ingestion machine.

Part 2 of 2 - Leveraging dbt-Snowflake to perform a Transformation Step (Parquet in S3 -> Snowflake External Tables -> Transformation in Snowflake via dbt).

Check it out here!

This project uses a Dockerized environment to extract Parquet files stored in S3 Buckets. External Tables were In Snowflake following Snowflake’s Storage Integration and External Stage procedures. Then, dbt perfors the Transformation step and materialize dimension and facts in the Silver Layer and Aggregated tables in the Gold schema, following the Medallion Architecture and Kimbal’s Dimensional Modeling.

ETL (Medallion Architecture and Kimball Dimensional Modeling) for Machine Learning (Churn Prediction), with Dockerized Postgres, Jupyter Notebook, and Python.

Check it out here!

I built a Python ETL pipeline using Python functions to perform ETL steps. This project runs within a Dockerized environment, using PostgreSQL as a database and Jupyter Notebook as a quick way to interact with the data and materialize schemas and tables. The ETL process followed the Medallion Architecture (bronze, silver, and gold schemas) and Kimbal’s Dimensional Modeling (Star Schema).

Migrating ETL (Medallion Architecture and Kimball Dimensional Modeling) to dbt, with Dockerized Postgres, Jupyter Notebook, and dbt.

Check it out here!

I expanded a previous work to mimic a project where we want to migrate Python ETL Processes to dbt, within a Dockerized environment. The data is extracted from multiple CSV files and both the Transformation and Loading steps are done against PostgreSQL, via dbt. The ETL process followed the Medallion Architecture (bronze, silver, and gold schemas) and Kimbal’s Dimensional Modeling (Star Schema).

ETL Pipeline from Crypto API to Tableau (CSV), with Dockerized Postgres, Jupyter Notebook, and Python.

Check it out here!

This ETL pipeline uses Python functions to perform ETL steps, extracting from an external API and transforming the data to be saved as CSV files for later use by Tableau or any other visualization tool. This project runs within a Dockerized environment, using PostgreSQL as a database and Jupyter Notebook as a quick way to interact with the data.

Applied Data Science

Causal Inference (Propensity Score Matching & Difference-in-Differences): Measuring the Effect of a New Recommendation System on an E-Commerce Marketplace

Check it out here!

Causal Inference (Difference-in-Differences): Measuring the Effect of a New Customer-Satisfaction Program on an Airline Company

Check it out here!

Data Analytics with Python (Best Practices)

Data Cleaning - Preparing Categorical Data for Modeling

Check it out here!

When datasets are large, it can take forever for a Machine Learning model to make predictions. We want to make sure that data is stored efficiently without having to change the size of the dataset.

Data Cleaning - Parsing Date and Time Zone for Modeling

Check it out here!

Best Practices when cleaning dates, time, and time zone.

Data Analysis and Inferential Statistics with Python

Check it out here!

As hobbies, I play football competitively (forward), it’s my passion. I have played in amateur leagues in Brazil, USA, and the Netherlands. I also have a strong passion for teaching and educating others. A personal characteristic I am proud of is the ability to transform very complex subjects into intuitive topics for any audience.

I find happiness in the little things in life and I also learned a lot from every mistake I have made so far (and still do).

Statistics & Data Science

I have a passion for teaching and I have been trained by amazing professors in top notch universities around the globe.

Therefore, I have started to write a book that belongs to a (future) course I call “An Intuitive Course in Probability (and Statistics), for data science. The idea is to provide strong intuition for every major concept while keeping the mathematical formalization and rigor very close. I had this idea after taking a Probability Theory course from MIT. I am a fan. It will be available both in English and Portuguese.

Please, check the English version here and the Portuguese version aqui!

It’s a working in progress, so you may find only part of Chapter 1 now.

Math

Sometimes, I try to contribute to some interesting communities. You can check an example below.

Math StackExchange
Here is an example.

Projects

Data Engineering

Implementing an SCD Type 2 dimension from a CDC source using Snowflakes’s Stored procedure and Data Quality Checks.

Lead Quality Process (Reading Parquet and CSVs from S3 -> Transforming with Object-Oriented Design -> Postgres (Bronze, Silver, Gold Layers)).

Part 1 of 2 - Leveraging dbt-DuckDB to perform an Ingestion Step (Postgres -> AWS S3 Bucket (Parquet)).

Part 2 of 2 - Leveraging dbt-Snowflake to perform a Transformation Step (Parquet in S3 -> Snowflake External Tables -> Transformation in Snowflake via dbt).

ETL (Medallion Architecture and Kimball Dimensional Modeling) for Machine Learning (Churn Prediction), with Dockerized Postgres, Jupyter Notebook, and Python.

Migrating ETL (Medallion Architecture and Kimball Dimensional Modeling) to dbt, with Dockerized Postgres, Jupyter Notebook, and dbt.

ETL Pipeline from Crypto API to Tableau (CSV), with Dockerized Postgres, Jupyter Notebook, and Python.

Applied Data Science

Causal Inference (Propensity Score Matching & Difference-in-Differences): Measuring the Effect of a New Recommendation System on an E-Commerce Marketplace

Causal Inference (Difference-in-Differences): Measuring the Effect of a New Customer-Satisfaction Program on an Airline Company

Data Analytics with Python (Best Practices)

Data Cleaning - Preparing Categorical Data for Modeling

Data Cleaning - Parsing Date and Time Zone for Modeling

Data Analysis and Inferential Statistics with Python

Sharing Knowledge

Statistics & Data Science

Math