Phipps Model Evaluation Scorecard

v5 — 20 tests, 3 runs, 9 models
Last run: 2026-02-26  |  v5 harness (3-run averaging, LLM-as-judge)  |  Bessemer-graded
composite = 0.4×quality + 0.2×speed + 0.2×value + 0.2×reliability

Leaderboard

Sorted by composite score  ·  40% quality + 20% speed + 20% value + 20% reliability

Spark Cluster — Quality & Latency Ranking

Local-runnable models only  ·  Quality scores from cloud API (quantized local scores TBD)  ·  Cluster: Mac Mini 24GB + Minisforum 128GB

Value Frontier — Quality vs. Cost (log scale)

Speed vs. Quality Tradeoff

Test Breakdown — Per-Test Score Heatmap (20 tests x 9 models)

5.0 — Perfect 4.5+ — Excellent 4.0 — Strong 3.5 — Good 3.0 — Average 2.0 — Weak 1.0 — Poor

Test Breakdown — Stacked Score Bars