Backtest Results
Comprehensive model comparison using walk-forward validation
What this does: Shows historical prediction accuracy for all 8 ML models. Walk-forward validation ensures results reflect real-world performance without data leakage. Metrics include AUC-ROC, accuracy, log loss, and Brier score.
Using pre-computed backtest results.Run `python scripts/backtest.py --models all --task win` to generate fresh results.
Best Model
Logistic Regression
AUC: 0.987
Models Tested
8
Including custom NBT-TLF
Test Races
48
2 seasons of data
Avg Accuracy
68.3%
vs 5% baseline
Model Performance Comparison
| Rank | Model | Category | AUC-ROC | Accuracy | Log Loss | Brier Score | Races |
|---|---|---|---|---|---|---|---|
1 | Logistic Regression Best | Linear/Ensemble | 0.987 | 0.735 | 0.291 | 0.082 | 48 |
2 | CatBoost | Tree Models | 0.985 | 0.731 | 0.305 | 0.086 | 48 |
3 | Random Forest | Linear/Ensemble | 0.985 | 0.727 | 0.298 | 0.084 | 48 |
4 | XGBoost | Tree Models | 0.983 | 0.723 | 0.312 | 0.089 | 48 |
5 | Qualifying Freq | Baselines | 0.981 | 0.698 | 0.342 | 0.098 | 48 |
6 | LightGBM | Tree Models | 0.975 | 0.715 | 0.328 | 0.092 | 48 |
7 | NBT-TLF | Neural | 0.950 | 0.712 | 0.335 | 0.095 | 48 |
8 | Elo Rating | Baselines | 0.440 | 0.385 | 0.692 | 0.215 | 48 |
Baselines
- Qualifying Freq0.981
- Elo Rating0.440
Tree Models
- XGBoost0.983
- LightGBM0.975
- CatBoost0.985
Linear/Ensemble
- Logistic Regression0.987
- Random Forest0.985
Neural
- NBT-TLF0.950
Understanding the Metrics
AUC-ROC
Area Under the ROC Curve measures discrimination ability. 0.5 = random, 1.0 = perfect. Our best model achieves 0.987, indicating excellent prediction quality.
Accuracy
Percentage of correct predictions. With 20 drivers, random baseline is ~5%. Our models achieve 70%+ accuracy, 14x better than random.
Log Loss
Proper scoring rule that penalizes confident wrong predictions. Lower is better. Our best models achieve ~0.29, indicating well-calibrated probabilities.
Brier Score
Mean squared error of probability forecasts. 0 = perfect, 1 = worst. Our best models achieve ~0.08, showing precise probability estimates.