Phipps Model Evaluation Scorecard
v5 — 20 PE-domain tests, 3-run averaging, LLM-as-judge scoring
Last run: 2026-03-06 | Mac Studio M3 Ultra 96GB | Bessemer-graded
composite = 0.4×quality + 0.2×speed + 0.2×value + 0.2×reliability
Scorecard
Eval Runs
Leaderboard
Sorted by composite score · 40% quality + 20% speed + 20% value + 20% reliability
Tried & Deleted
Evaluated and removed from disk — kept for reference
Hardware Summary — Model Architecture & Sizing
Local-runnable models only · Quality scores from cloud API (quantized local scores TBD) · Cluster: Mac Mini M4 Pro 24GB + Mac Studio M3 Ultra 256GB
Quant Size vs. Quality
Speed vs. Quality Tradeoff
Test Breakdown — Per-Test Score Heatmap (20 tests x 5 models)
5.0 — Perfect
4.5+ — Excellent
4.0 — Strong
3.5 — Good
3.0 — Average
2.0 — Weak
1.0 — Poor
Test Breakdown — Stacked Score Bars