Phipps Model Evaluation Scorecard
PE-Domain Quality Assessment โ 12 Models, 12 Tests
Last run: 2026-02-26 | v4 harness (fair token budgets) | Bessemer-graded
Composite = 70% Quality + 15% Speed (penalty >20s) + 10% Cost (penalty >$0.01) + 5% Reliability
Leaderboard
Sorted by composite score · 70% quality + 15% speed penalty (>20s) + 10% cost penalty (>$0.01) + 5% reliability
Value Frontier โ Quality vs. Cost (log scale)
Speed vs. Quality Tradeoff
Test Breakdown โ All Models (composite ≥ 3.0)
Test Details