Backtest Results

Comprehensive model comparison using walk-forward validation

What this does: Shows historical prediction accuracy for all 8 ML models. Walk-forward validation ensures results reflect real-world performance without data leakage. Metrics include AUC-ROC, accuracy, log loss, and Brier score.
Using pre-computed backtest results.Run `python scripts/backtest.py --models all --task win` to generate fresh results.
Best Model
Logistic Regression
AUC: 0.987
Models Tested
8
Including custom NBT-TLF
Test Races
48
2 seasons of data
Avg Accuracy
68.3%
vs 5% baseline

Model Performance Comparison

RankModelCategoryAUC-ROCAccuracyLog LossBrier ScoreRaces
1
Logistic Regression
Best
Linear/Ensemble
0.987
0.7350.2910.08248
2
CatBoost
Tree Models
0.985
0.7310.3050.08648
3
Random Forest
Linear/Ensemble
0.985
0.7270.2980.08448
4
XGBoost
Tree Models
0.983
0.7230.3120.08948
5
Qualifying Freq
Baselines
0.981
0.6980.3420.09848
6
LightGBM
Tree Models
0.975
0.7150.3280.09248
7
NBT-TLF
Neural
0.950
0.7120.3350.09548
8
Elo Rating
Baselines
0.440
0.3850.6920.21548

Baselines

  • Qualifying Freq0.981
  • Elo Rating0.440

Tree Models

  • XGBoost0.983
  • LightGBM0.975
  • CatBoost0.985

Linear/Ensemble

  • Logistic Regression0.987
  • Random Forest0.985

Neural

  • NBT-TLF0.950

Understanding the Metrics

AUC-ROC

Area Under the ROC Curve measures discrimination ability. 0.5 = random, 1.0 = perfect. Our best model achieves 0.987, indicating excellent prediction quality.

Accuracy

Percentage of correct predictions. With 20 drivers, random baseline is ~5%. Our models achieve 70%+ accuracy, 14x better than random.

Log Loss

Proper scoring rule that penalizes confident wrong predictions. Lower is better. Our best models achieve ~0.29, indicating well-calibrated probabilities.

Brier Score

Mean squared error of probability forecasts. 0 = perfect, 1 = worst. Our best models achieve ~0.08, showing precise probability estimates.