Phipps Model Evaluation Scorecard

PE-Domain Quality Assessment โ€” 12 Models, 12 Tests
Last run: 2026-02-26  |  v4 harness (fair token budgets)  |  Bessemer-graded
Composite = 70% Quality + 15% Speed (penalty >20s) + 10% Cost (penalty >$0.01) + 5% Reliability

Leaderboard

Sorted by composite score  ·  70% quality + 15% speed penalty (>20s) + 10% cost penalty (>$0.01) + 5% reliability

Value Frontier โ€” Quality vs. Cost (log scale)

Speed vs. Quality Tradeoff

Test Breakdown โ€” All Models (composite ≥ 3.0)

Test Details