Flight Delay Prediction at Scale
Scalable ML pipeline on Databricks integrating 28M flights + 131M weather records with blocked time-series cross-validation.

The Problem
Massive Imbalanced Data
Airlines and passengers need accurate delay predictions, but flight data is massive, imbalanced, and temporally sensitive. The challenge was building a scalable ML pipeline that avoids time leakage.
The Approach
Distributed Spark Pipeline
Built a distributed pipeline on Databricks/Spark integrating 28M flight records + 131M NOAA weather observations. Engineered 221 features including airport PageRank centrality. Used blocked time-series cross-validation (5 folds) with SMOTE and class rebalancing.
Technologies & Methods
The Results
28M Records Processed
Achieved 54.6% F1 on 3-class (Early/On-Time/Delayed) prediction with a 6-layer MLP, using a pipeline that processed 28M+ records in 17–30 hours on a 5–10 node cluster. Probability recalibration improved F1 by ~6% across 9 models by correcting oversampling bias and matching true class rates.