Phipps Eval Runs

All evaluation runs across models, quants, and hardware

Quality Over Time — Best Run per Model

Each point = one eval run · Hover for details · Only showing best run per quant for clarity

All Eval Runs

Click a row for per-test breakdown · API = cloud · MLX = local Apple Silicon · Dense / MoE = architecture · pre-fix = buggy skip scoring

Per-Test Breakdown

5.0 4.5+ 4.0 3.5 3.0 2.0 1.0 Skipped