Phipps Model Evaluation Scorecard
v5 — 20 tests, 3 runs, 9 models
Last run: 2026-02-26 | v5 harness (3-run averaging, LLM-as-judge) | Bessemer-graded
composite = 0.4×quality + 0.2×speed + 0.2×value + 0.2×reliability
Leaderboard
Sorted by composite score · 40% quality + 20% speed + 20% value + 20% reliability
Spark Cluster — Quality & Latency Ranking
Local-runnable models only · Quality scores from cloud API (quantized local scores TBD) · Cluster: Mac Mini 24GB + Minisforum 128GB
Value Frontier — Quality vs. Cost (log scale)
Speed vs. Quality Tradeoff
Test Breakdown — Per-Test Score Heatmap (20 tests x 9 models)
5.0 — Perfect
4.5+ — Excellent
4.0 — Strong
3.5 — Good
3.0 — Average
2.0 — Weak
1.0 — Poor
Test Breakdown — Stacked Score Bars