AI Flash Report

GPQA Diamond leaderboard

GPQA Diamond is the hardest tier of the GPQA benchmark — graduate-level science questions designed to be Google-proof.

13 models ranked, highest score first.

GPQA Diamond leaderboard — 13 models ranked by score
# Model Company Score
1 Gemini 3.1 Pro Google 84.2%
2 Claude Opus 4.5 Anthropic 82.4%
3 GPT-5.2 OpenAI 80.1%
4 Claude Opus 4.1 Anthropic 79.1%
5 Gemini 3 Pro Google 78.5%
6 Claude Sonnet 4.6 Anthropic 78.4%
7 GPT-5.1 OpenAI 77.8%
8 GPT-5 OpenAI 74.2%
9 Kimi K2 Moonshot AI 74.1%
10 Claude Sonnet 4 Anthropic 74.0%
11 DeepSeek V3.2 DeepSeek 68.4%
12 Claude Sonnet 3.7 Anthropic 68.3%
13 Claude 3.5 Sonnet Anthropic 59.4%