GPQA Diamond leaderboard
GPQA Diamond is the hardest tier of the GPQA benchmark — graduate-level science questions designed to be Google-proof.
13 models ranked, highest score first.
| # | Model | Company | Score |
|---|---|---|---|
| 1 | Gemini 3.1 Pro | 84.2% | |
| 2 | Claude Opus 4.5 | Anthropic | 82.4% |
| 3 | GPT-5.2 | OpenAI | 80.1% |
| 4 | Claude Opus 4.1 | Anthropic | 79.1% |
| 5 | Gemini 3 Pro | 78.5% | |
| 6 | Claude Sonnet 4.6 | Anthropic | 78.4% |
| 7 | GPT-5.1 | OpenAI | 77.8% |
| 8 | GPT-5 | OpenAI | 74.2% |
| 9 | Kimi K2 | Moonshot AI | 74.1% |
| 10 | Claude Sonnet 4 | Anthropic | 74.0% |
| 11 | DeepSeek V3.2 | DeepSeek | 68.4% |
| 12 | Claude Sonnet 3.7 | Anthropic | 68.3% |
| 13 | Claude 3.5 Sonnet | Anthropic | 59.4% |