Formula 1 Winner Predictor

Personal ProjectData Science

An end-to-end machine learning system designed to predict the finishing position of F1 drivers by combining historical race data, real-time qualifying results, and weather forecasts.

Technologies

PythonPandasScikit-learnRandom ForestKNNSMOTEStratifiedShuffleSplitErgast APIVisual Crossing APIData PreprocessingFeature EngineeringGrid SearchRandomizedSearchCVWeather Data IntegrationClassification ModelingModel Evaluation

Overview

The Formula 1 Predictor is a data-driven classification pipeline built to forecast race outcomes for F1 drivers. The system integrates multiple structured datasets, including historical race performance, driver stats, constructor data, circuit-specific results, and detailed weather forecasts, to model the complexity of race-day dynamics. It features a stratified training process, multi-source feature engineering, and ensemble-based classification for high-accuracy predictions.

Challenge

Accurately predicting F1 race outcomes requires integrating highly diverse datasets, from the Ergast racing database and Visual Crossing weather API to real-time qualifying data. Challenges included cleaning and synchronizing these sources, engineering relevant features, handling class imbalance in categorical outcomes, and modeling the nonlinear relationships between environmental, driver, and constructor factors. Ensuring fair evaluation and meaningful class representation was essential for producing reliable predictions.

Solution

I built a complete preprocessing and modeling pipeline in Python. The dataset was constructed by merging and transforming data from Ergast and Visual Crossing, including driver and constructor attributes, race metadata, circuit-specific history, and forecasted race-day conditions. I engineered categorical outcome bins (win/top 5/top 10/outside top 10) and used RandomForest with grid and randomized hyperparameter search. I handled class imbalance with SMOTE, stratified splits, and class weighting. Qualifying times were converted to numerical formats, and redundant features were dropped to improve model clarity. KNN served as a baseline. Final evaluation included classification reports and confusion matrices across all result categories.

Results

Achieved strong classification performance with RandomForest, outperforming baseline KNN. The model reliably predicted winners and top finishers using multi-domain data. The system is modular, reproducible, and ready for deployment in racing analytics, betting, or broadcasting augmentation. The inclusion of real-time qualifying and weather data significantly enhanced short-term prediction accuracy.

Documentation

Formula 1 Predictor Presentation

Technical presentation on the Formula 1 Predictor system architecture and implementation details