AI Flash Report

GPQA Diamond leaderboard

GPQA Diamond is the hardest tier of the GPQA benchmark — graduate-level science questions designed to be Google-proof.

101 models ranked, highest score first.

GPQA Diamond leaderboard — 101 models ranked by score
# Model Company Score
1 Gemini 3.1 Pro Preview Google 94.1%
2 GPT-5.5 OpenAI 93.5%
3 MiniMax-M3 MiniMax 92.9%
4 Qwen3.7 Max Alibaba 92.3%
5 Claude Opus 4.8 Anthropic 92.0%
6 GPT-5.4 OpenAI 92.0%
7 GPT-5.3 Codex OpenAI 91.5%
8 Claude Opus 4.7 Anthropic 91.4%
9 Kimi K2.6 Kimi 91.1%
10 Grok 4.20 0309 v2 xAI 91.1%
11 GPT-5.2 OpenAI 90.3%
12 Grok 4.3 xAI 90.1%
13 Qwen3.7 Plus Alibaba 90.0%
14 DeepSeek V4 Flash DeepSeek 89.4%
15 Qwen3.5 397B A17B Alibaba 89.3%
16 DeepSeek V4 Pro DeepSeek 88.8%
17 Qwen3.6 Max Preview Alibaba 88.8%
18 Grok 4.20 0309 xAI 88.5%
19 Muse Spark Meta 88.4%
20 Qwen3.6 Plus Alibaba 88.2%
21 GPT-5.4 mini OpenAI 87.5%
22 MiniMax-M2.7 MiniMax 87.4%
23 GPT-5.1 OpenAI 87.3%
24 MiMo-V2-Pro Xiaomi 87.0%
25 GLM-5.1 Z AI 86.8%
26 Nemotron 3 Ultra 550B A55B NVIDIA 86.7%
27 Hy3-preview Tencent 86.7%
28 MiMo-V2.5-Pro Xiaomi 86.6%
29 Claude Opus 4.5 Anthropic 86.6%
30 Qwen3.5 27B Alibaba 85.8%
31 Ring-2.6-1T InclusionAI 85.7%
32 Gemma 4 31B Google 85.7%
33 Qwen3.5 122B A10B Alibaba 85.7%
34 MiMo-V2-Omni-0327 Xiaomi 85.5%
35 MiMo-V2.5 Xiaomi 84.9%
36 GLM-5-Turbo Z AI 84.7%
37 GPT-5.5 Instant OpenAI 84.6%
38 Qwen3.5 35B A3B Alibaba 84.5%
39 Qwen3.6 27B Alibaba 84.2%
40 Gemini 3.1 Pro Google 84.2%
41 Qwen3.6 35B A3B Alibaba 84.1%
42 DeepSeek V3.2 DeepSeek 84.0%
43 JT-35B-Flash China Mobile 82.9%
44 Gemini 3.5 Flash Google 82.8%
45 MiMo-V2-Omni Xiaomi 82.8%
46 Step 3.5 Flash 2603 StepFun 82.6%
47 Qwen3.5 Omni Plus Alibaba 82.6%
48 Gemini 3.1 Flash-Lite Preview Google 82.2%
49 GPT-5.4 nano OpenAI 81.7%
50 Step 3.7 Flash StepFun 80.9%
51 GLM 5V Turbo Z AI 80.9%
52 Qwen3.5 9B Alibaba 80.6%
53 NVIDIA Nemotron 3 Super 120B A12B NVIDIA 80.0%
54 Claude Sonnet 4.6 Anthropic 79.7%
55 EXAONE 4.5 33B LG AI Research 79.4%
56 Gemma 4 26B A4B Google 79.2%
57 Claude Opus 4.1 Anthropic 79.1%
58 Gemini 2.5 Flash Google 79.0%
59 Gemini 3 Pro Google 78.5%
60 Qwen3.5 4B Alibaba 77.1%
61 Mercury 2 Inception 77.0%
62 Mistral Small 4 Mistral 76.9%
63 Nemotron Cascade 2 30B A3B NVIDIA 75.8%
64 Gemma 4 12B Google 75.3%
65 Ling-2.6-1T InclusionAI 75.2%
66 Trinity Large Thinking Arcee AI 75.2%
67 Mistral Medium 3.5 Mistral 74.8%
68 Qwen3.5 Omni Flash Alibaba 74.2%
69 Kimi K2 Moonshot AI 74.1%
70 Claude Sonnet 4 Anthropic 74.0%
71 Sarvam 105B Sarvam 73.8%
72 Solar Pro 3 Upstage 72.4%
73 Claude Sonnet 3.7 Anthropic 68.3%
74 Mistral Large 3 Mistral 68.0%
75 JT-MINI China Mobile 67.6%
76 GPT-5 OpenAI 67.3%
77 Gemini 2.0 Flash Google 63.6%
78 Sarvam 30B Sarvam 63.3%
79 Ling 2.6 Flash InclusionAI 59.3%
80 Gemma 4 E4B Google 57.6%
81 Claude 3.5 Sonnet Anthropic 56.0%
82 DeepSeek-V3 DeepSeek 55.7%
83 NVIDIA Nemotron 3 Nano 4B NVIDIA 51.3%
84 Grok-2 xAI 51.0%
85 Claude 3 Opus Anthropic 48.9%
86 Granite 4.1 30B IBM 48.1%
87 LFM2 24B A2B Liquid AI 47.4%
88 Nemotron 3 Nano Omni 30B A3B Reasoning NVIDIA 46.9%
89 Qwen3.5 2B Alibaba 45.6%
90 Granite 4.1 8B IBM 43.3%
91 Gemma 4 E2B Google 43.3%
92 Claude 3 Sonnet Anthropic 40.0%
93 Claude 3 Haiku Anthropic 37.4%
94 Gemini 1.5 Pro Google 37.1%
95 Mistral Large Mistral 35.1%
96 Claude 2.1 Anthropic 31.9%
97 Granite 4.1 3B IBM 31.4%
98 MiniCPM-V 4.6 1.3B OpenBMB 30.5%
99 Tiny Aya Global Cohere 30.5%
100 MiniCPM5-1B OpenBMB 27.8%
101 Qwen3.5 0.8B Alibaba 11.1%