Flight Delay Prediction at Scale
Scalable ML pipeline on Databricks integrating 28M flights + 131M weather records with blocked time-series cross-validation.

The Problem
Context & Challenge
Airlines and passengers need accurate delay predictions, but flight data is massive, imbalanced, and temporally sensitive. The challenge was building a scalable ML pipeline that avoids time leakage.
The Approach
Architecture & Implementation
Built a distributed pipeline on Databricks/Spark integrating 28M flight records + 131M NOAA weather observations. Engineered 221 features including airport PageRank centrality. Used blocked time-series cross-validation (5 folds) with SMOTE and class rebalancing.
The Results
Impact & Metrics
Achieved 54.6% F1 score on 3-class prediction (Early/On-Time/Delayed) using a 6-layer MLP. Pipeline processed 160M+ combined records in 17-30 hours on a 5-10 node cluster.