MMLU-Pro leaderboard
MMLU-Pro is a 12,000-question multitask benchmark covering 14 subject areas, designed to differentiate frontier models on harder reasoning.
36 models ranked, highest score first.
| # | Model | Company | Score |
|---|---|---|---|
| 1 | Gemini 3.1 Pro | 93.8% | |
| 2 | Claude Sonnet 4.6 | Anthropic | 92.1% |
| 3 | Kimi K2 | Moonshot AI | 91.3% |
| 4 | Claude Opus 4.1 | Anthropic | 91.2% |
| 5 | Gemini Ultra | 90.0% | |
| 6 | Claude Opus 4.5 | Anthropic | 89.5% |
| 7 | Gemini 3 Pro | 89.4% | |
| 8 | GLM-5 | Zhipu AI | 88.7% |
| 9 | Claude Sonnet 4 | Anthropic | 88.7% |
| 10 | GPT-5.2 | OpenAI | 87.4% |
| 11 | GPT-5.1 | OpenAI | 87.0% |
| 12 | GPT-4 | OpenAI | 86.4% |
| 13 | DeepSeek V3.2 | DeepSeek | 86.2% |
| 14 | Claude Sonnet 3.7 | Anthropic | 86.1% |
| 15 | Gemini 2.5 Flash | 83.2% | |
| 16 | Mistral Large 3 | Mistral | 80.7% |
| 17 | GPT-5 | OpenAI | 80.6% |
| 18 | Gemini Pro | 79.1% | |
| 19 | Claude 2 | Anthropic | 78.5% |
| 20 | PaLM 2 | 78.3% | |
| 21 | Gemini 2.0 Flash | 78.2% | |
| 22 | DeepSeek-V3 | DeepSeek | 75.2% |
| 23 | Claude 3 Haiku | Anthropic | 75.2% |
| 24 | Claude 3.5 Sonnet | Anthropic | 75.1% |
| 25 | Claude 1.3 | Anthropic | 75.0% |
| 26 | Grok-2 | xAI | 70.9% |
| 27 | ChatGPT (GPT-3.5 Turbo) | OpenAI | 70.0% |
| 28 | Claude 3 Opus | Anthropic | 69.6% |
| 29 | GPT-4 Turbo | OpenAI | 69.4% |
| 30 | PaLM | 69.3% | |
| 31 | Llama 2 70B | Meta | 68.9% |
| 32 | Gemini 1.5 Pro | 65.7% | |
| 33 | Claude 3 Sonnet | Anthropic | 57.9% |
| 34 | Mistral Large | Mistral | 51.5% |
| 35 | Claude 2.1 | Anthropic | 49.5% |
| 36 | GPT-3 | OpenAI | 43.9% |