Ranked by MATH, GPQA Diamond, and MMLU scores. Covers advanced mathematics, science PhD-level questions, and general knowledge.
Updated April 2026 · Source: AI Flash Report model database
| # | Model | MATH | GPQA Diamond | MMLU |
|---|---|---|---|---|
| #1 |
Gemini 3.1 Pro
Google
|
89.4% | 84.2% | 93.8% |
| #2 |
GPT-5.2
OpenAI
|
88.5% | 80.1% | 92.8% |
| #3 |
GPT-o1-preview
OpenAI
|
83% | 78% | — |
| #4 |
GPT-5
OpenAI
|
85.0% | 74.2% | 91.0% |
| #5 |
DeepSeek V3.2
DeepSeek
|
85.6% | 68.4% | 90.1% |
| #6 |
Claude Sonnet 4
Anthropic
|
76.8% | 74.0% | 91.2% |
| #7 |
Claude Sonnet 3.7
Anthropic
|
74.1% | 68.3% | 89.5% |
| #8 |
Claude 3.5 Sonnet
Anthropic
|
71.1% | 59.4% | 88.7% |
| #9 |
Claude 3 Opus
Anthropic
|
60.1% | — | 86.8% |
| #10 |
Claude 2
Anthropic
|
88.0% | — | 78.5% |
| #11 |
Claude Opus 4.5
Anthropic
|
— | 82.4% | 91.5% |
| #12 |
Claude Opus 4.1
Anthropic
|
— | 79.1% | 90.8% |
| #13 |
Gemini 3 Pro
Google
|
— | 78.5% | 93.2% |
| #14 |
Claude Sonnet 4.6
Anthropic
|
— | 78.4% | 92.1% |
| #15 |
GPT-5.1
OpenAI
|
— | 77.8% | 92.5% |