AI Flash Report

GPQA Diamond leaderboard

GPQA Diamond is the hardest tier of the GPQA benchmark — graduate-level science questions designed to be Google-proof.

101 models ranked, highest score first.

GPQA Diamond leaderboard — 101 models ranked by score
# Model Company Score
1 GPT-5.5 OpenAI 93.5%
2 MiniMax-M3 MiniMax 92.9%
3 Claude Fable 5 Anthropic 92.6%
4 Qwen3.7 Max Alibaba 92.3%
5 Claude Opus 4.8 Anthropic 92.0%
6 GPT-5.4 OpenAI 92.0%
7 GPT-5.3 Codex OpenAI 91.5%
8 Claude Opus 4.7 Anthropic 91.4%
9 Kimi K2.6 Kimi 91.1%
10 Grok 4.20 0309 v2 xAI 91.1%
11 GPT-5.2 OpenAI 90.3%
12 Grok 4.3 xAI 90.1%
13 Qwen3.7 Plus Alibaba 90.0%
14 DeepSeek V4 Flash DeepSeek 89.4%
15 DeepSeek V4 Pro DeepSeek 88.8%
16 Qwen3.6 Max Preview Alibaba 88.8%
17 Grok 4.20 0309 xAI 88.5%
18 Muse Spark Meta 88.4%
19 Qwen3.6 Plus Alibaba 88.2%
20 GPT-5.4 mini OpenAI 87.5%
21 MiniMax-M2.7 MiniMax 87.4%
22 GPT-5.1 OpenAI 87.3%
23 MiMo-V2-Pro Xiaomi 87.0%
24 GLM-5.1 Z AI 86.8%
25 Nemotron 3 Ultra 550B A55B NVIDIA 86.7%
26 Hy3-preview Tencent 86.7%
27 MiMo-V2.5-Pro Xiaomi 86.6%
28 Claude Opus 4.5 Anthropic 86.6%
29 Qwen3.5 27B Alibaba 85.8%
30 Ring-2.6-1T InclusionAI 85.7%
31 Gemma 4 31B Google 85.7%
32 Qwen3.5 122B A10B Alibaba 85.7%
33 MiMo-V2-Omni-0327 Xiaomi 85.5%
34 MiMo-V2.5 Xiaomi 84.9%
35 GLM-5-Turbo Z AI 84.7%
36 GPT-5.5 Instant OpenAI 84.6%
37 Qwen3.5 35B A3B Alibaba 84.5%
38 Qwen3.6 27B Alibaba 84.2%
39 Gemini 3.1 Pro Google 84.2%
40 Qwen3.6 35B A3B Alibaba 84.1%
41 DeepSeek V3.2 DeepSeek 84.0%
42 JT-35B-Flash China Mobile 82.9%
43 Gemini 3.5 Flash Google 82.8%
44 MiMo-V2-Omni Xiaomi 82.8%
45 Step 3.5 Flash 2603 StepFun 82.6%
46 Qwen3.5 Omni Plus Alibaba 82.6%
47 Gemini 3.1 Flash-Lite Preview Google 82.2%
48 GPT-5.4 nano OpenAI 81.7%
49 Step 3.7 Flash StepFun 80.9%
50 GLM 5V Turbo Z AI 80.9%
51 Qwen3.5 9B Alibaba 80.6%
52 NVIDIA Nemotron 3 Super 120B A12B NVIDIA 80.0%
53 Claude Sonnet 4.6 Anthropic 79.7%
54 EXAONE 4.5 33B LG AI Research 79.4%
55 Gemma 4 26B A4B Google 79.2%
56 Claude Opus 4.1 Anthropic 79.1%
57 Gemini 2.5 Flash Google 79.0%
58 Gemini 3 Pro Google 78.5%
59 Qwen3.5 4B Alibaba 77.1%
60 Mistral Small 4 Mistral 76.9%
61 Nemotron Cascade 2 30B A3B NVIDIA 75.8%
62 North Mini Code Cohere 75.7%
63 Gemma 4 12B Google 75.3%
64 Ling-2.6-1T InclusionAI 75.2%
65 Trinity Large Thinking Arcee AI 75.2%
66 Mistral Medium 3.5 Mistral 74.8%
67 Qwen3.5 Omni Flash Alibaba 74.2%
68 Kimi K2 Moonshot AI 74.1%
69 Claude Sonnet 4 Anthropic 74.0%
70 Sarvam 105B Sarvam 73.8%
71 HyperNova 60B 2605 Multiverse Computing 73.3%
72 Solar Pro 3 Upstage 72.4%
73 Claude Sonnet 3.7 Anthropic 68.3%
74 Mistral Large 3 Mistral 68.0%
75 JT-MINI China Mobile 67.6%
76 GPT-5 OpenAI 67.3%
77 Gemini 2.0 Flash Google 63.6%
78 Sarvam 30B Sarvam 63.3%
79 Ling 2.6 Flash InclusionAI 59.3%
80 Gemma 4 E4B Google 57.6%
81 Claude 3.5 Sonnet Anthropic 56.0%
82 DeepSeek-V3 DeepSeek 55.7%
83 LFM2.5-8B-A1B Liquid AI 51.3%
84 NVIDIA Nemotron 3 Nano 4B NVIDIA 51.3%
85 Grok-2 xAI 51.0%
86 Claude 3 Opus Anthropic 48.9%
87 Granite 4.1 30B IBM 48.1%
88 LFM2 24B A2B Liquid AI 47.4%
89 Nemotron 3 Nano Omni 30B A3B Reasoning NVIDIA 46.9%
90 Qwen3.5 2B Alibaba 45.6%
91 Granite 4.1 8B IBM 43.3%
92 Gemma 4 E2B Google 43.3%
93 Claude 3 Sonnet Anthropic 40.0%
94 Claude 3 Haiku Anthropic 37.4%
95 Gemini 1.5 Pro Google 37.1%
96 Mistral Large Mistral 35.1%
97 Claude 2.1 Anthropic 31.9%
98 Granite 4.1 3B IBM 31.4%
99 MiniCPM-V 4.6 1.3B OpenBMB 30.5%
100 MiniCPM5-1B OpenBMB 27.8%
101 Qwen3.5 0.8B Alibaba 11.1%