Machine LearningData Engineering

Flight Delay Prediction at Scale

Scalable ML pipeline on Databricks integrating 28M flights + 131M weather records with blocked time-series cross-validation.

The Problem

Massive Imbalanced Data

Airlines and passengers need accurate delay predictions, but flight data is massive, imbalanced, and temporally sensitive. The challenge was building a scalable ML pipeline that avoids time leakage.

The Approach

Distributed Spark Pipeline

Built a distributed pipeline on Databricks/Spark integrating 28M flight records + 131M NOAA weather observations. Engineered 221 features including airport PageRank centrality. Used blocked time-series cross-validation (5 folds) with SMOTE and class rebalancing.

Technologies & Methods

PythonPySparkApache SparkDatabricksMLP Neural NetworksBlocked Time-Series Cross-ValidationFeature EngineeringPageRankLogistic RegressionRandom ForestHyperparameter Tuning (Grid Search/Optuna)Probability CalibrationSMOTEOver/Under Sampling

The Results

28M Records Processed

Achieved 54.6% F1 on 3-class (Early/On-Time/Delayed) prediction with a 6-layer MLP, using a pipeline that processed 28M+ records in 17–30 hours on a 5–10 node cluster. Probability recalibration improved F1 by ~6% across 9 models by correcting oversampling bias and matching true class rates.

View Source

View Presentation Slides View Report

Key Result

54.6% F1 on 3-class (Early/On-Time/Delayed) prediction using a 6-layer MLP.

Technologies & Methods

View Presentation Slides

View Report

Back to Projects

Machine LearningData Engineering

Flight Delay Prediction at Scale

Scalable ML pipeline on Databricks integrating 28M flights + 131M weather records with blocked time-series cross-validation.

The Problem

Massive Imbalanced Data

Airlines and passengers need accurate delay predictions, but flight data is massive, imbalanced, and temporally sensitive. The challenge was building a scalable ML pipeline that avoids time leakage.

The Approach

Distributed Spark Pipeline

Technologies & Methods

The Results

28M Records Processed

View Source

View Presentation Slides View Report

Key Result

54.6% F1 on 3-class (Early/On-Time/Delayed) prediction using a 6-layer MLP.

Technologies & Methods

View Presentation Slides

View Report