AI Flash Report

AI benchmarks

Sortable leaderboards for the benchmarks frontier labs report. Pick one to see every model's score.

MMLU-Pro

33 models scored

Leader: Gemini 3.1 Pro 93.8%

ARC-AGI-2

1 models scored

Leader: Gemini 3.1 Pro 77.1%

SWE-bench Verified

14 models scored

Leader: GPT-5.3 Codex 82.4%

GPQA Diamond

13 models scored

Leader: Gemini 3.1 Pro 84.2%

LiveCodeBench

7 models scored

Leader: GPT-5.3 Codex 84.2%

HumanEval

20 models scored

Leader: GPT-5.3 Codex 96.8%