AI benchmarks
Sortable leaderboards for the benchmarks frontier labs report. Pick one to see every model's score.
MMLU-Pro
33 models scored
Leader: Gemini 3.1 Pro 93.8%
ARC-AGI-2
1 models scored
Leader: Gemini 3.1 Pro 77.1%
SWE-bench Verified
14 models scored
Leader: GPT-5.3 Codex 82.4%
GPQA Diamond
13 models scored
Leader: Gemini 3.1 Pro 84.2%
LiveCodeBench
7 models scored
Leader: GPT-5.3 Codex 84.2%
HumanEval
20 models scored
Leader: GPT-5.3 Codex 96.8%