Phipps Eval Runs
All evaluation runs across models, quants, and hardware
Scorecard
Eval Runs
Quality Over Time — Best Run per Model
Each point = one eval run · Hover for details · Only showing best run per quant for clarity
All Eval Runs
Click a row for per-test breakdown ·
API
= cloud ·
MLX
= local Apple Silicon ·
Dense
/
MoE
= architecture ·
pre-fix
= buggy skip scoring
Per-Test Breakdown
5.0
4.5+
4.0
3.5
3.0
2.0
1.0
Skipped