Phipps Eval

PE Energy AI Benchmark — loading...
Value Leaderboard — Adjusted Score
Composite Score vs Est. Monthly Cost
Composite Score vs Speed (tokens/sec)
Model Scores
Category Breakdown
Weighted contribution to composite score (PE 32% · Conv 15% · Tool 15% · Voice 15% · Edge 23%)
Per-Test Heatmap
Strengths & Weaknesses
Excluded Models
Models evaluated but removed from the active scorecard
Model Composite Reason
Claude Opus 4.6 4.46 Too expensive for production ($5.00/$25.00 per MTok)
Claude Sonnet 4.6 4.33 Too expensive for production
GPT-5.4 4.05 Dominated by GPT-5.4 Mini on quality-per-dollar
Kimi K2.5 3.93 Removed from active evaluation
MiniMax M2.7 4.18 Superseded by M2.7 HighSpeed
Gemini 2.5 Flash 3.80 Superseded by Gemini 3.1 Flash Lite
Gemini 3 Flash Preview 3.81 Superseded by Gemini 3.1 Flash Lite
Step 3.5 Flash 3.34 Removed from active evaluation
GLM-5 Turbo 3.99 Redundant with GLM-5 at higher cost
GLM-5 3.93 $73/mo, mid-pack quality
DeepSeek R1 3.65 Reasoning tokens inflate cost without PE benefit
Llama 4 Maverick 3.48 Lowest composite score