Phipps Model Evaluation Scorecard
PE-Domain Quality Assessment — 19 Models, 12 Tests, 63 Criteria
Last run: 2026-02-25 | v3 harness (12 tests) | Bessemer-graded
Leaderboard
Value Frontier — Quality vs. Cost (log)
PE Skills — v3 Head-to-Head
Test Breakdown — Full-Suite Models
Test Details