Backtest Results

Comprehensive model comparison using walk-forward validation

What this does: Shows historical prediction accuracy for all 8 ML models. Walk-forward validation ensures results reflect real-world performance without data leakage. Metrics include AUC-ROC, accuracy, log loss, and Brier score.

Using pre-computed backtest results.Run `python scripts/backtest.py --models all --task win` to generate fresh results.

Best Model

Logistic Regression

AUC: 0.987

Models Tested

Including custom NBT-TLF

Test Races

2 seasons of data

Avg Accuracy

68.3%

vs 5% baseline

Model Performance Comparison

Rank	Model	Category	AUC-ROC	Accuracy	Log Loss	Brier Score	Races
1	Logistic Regression Best	Linear/Ensemble	0.987	0.735	0.291	0.082	48
2	CatBoost	Tree Models	0.985	0.731	0.305	0.086	48
3	Random Forest	Linear/Ensemble	0.985	0.727	0.298	0.084	48
4	XGBoost	Tree Models	0.983	0.723	0.312	0.089	48
5	Qualifying Freq	Baselines	0.981	0.698	0.342	0.098	48
6	LightGBM	Tree Models	0.975	0.715	0.328	0.092	48
7	NBT-TLF	Neural	0.950	0.712	0.335	0.095	48
8	Elo Rating	Baselines	0.440	0.385	0.692	0.215	48

Baselines

Qualifying Freq0.981
Elo Rating0.440

Tree Models

XGBoost0.983
LightGBM0.975
CatBoost0.985

Linear/Ensemble

Logistic Regression0.987
Random Forest0.985

Neural

NBT-TLF0.950

Understanding the Metrics

AUC-ROC

Area Under the ROC Curve measures discrimination ability. 0.5 = random, 1.0 = perfect. Our best model achieves 0.987, indicating excellent prediction quality.

Accuracy

Percentage of correct predictions. With 20 drivers, random baseline is ~5%. Our models achieve 70%+ accuracy, 14x better than random.

Log Loss

Proper scoring rule that penalizes confident wrong predictions. Lower is better. Our best models achieve ~0.29, indicating well-calibrated probabilities.

Brier Score

Mean squared error of probability forecasts. 0 = perfect, 1 = worst. Our best models achieve ~0.08, showing precise probability estimates.

← Back to Home