Phipps Model Evaluation Scorecard

v5 — 20 tests, 3 runs, 7 local models (Studio + Mini)
Last run: 2026-03-02  |  v5 harness (3-run averaging, LLM-as-judge)  |  Bessemer-graded
composite = 0.4×quality + 0.2×speed + 0.2×value + 0.2×reliability

Leaderboard

Sorted by composite score  ·  40% quality + 20% speed + 20% value + 20% reliability

Hardware Summary — Model Architecture & Sizing

Local-runnable models only  ·  Quality scores from cloud API (quantized local scores TBD)  ·  Cluster: Mac Mini M4 Pro 24GB + Mac Studio M3 Ultra 256GB

Quant Size vs. Quality

Speed vs. Quality Tradeoff

Test Breakdown — Per-Test Score Heatmap (20 tests x 7 models)

5.0 — Perfect 4.5+ — Excellent 4.0 — Strong 3.5 — Good 3.0 — Average 2.0 — Weak 1.0 — Poor

Test Breakdown — Stacked Score Bars