AI Flash Report

MMLU-Pro leaderboard

MMLU-Pro is a 12,000-question multitask benchmark covering 14 subject areas, designed to differentiate frontier models on harder reasoning.

36 models ranked, highest score first.

MMLU-Pro leaderboard — 36 models ranked by score
# Model Company Score
1 Gemini 3.1 Pro Google 93.8%
2 Claude Sonnet 4.6 Anthropic 92.1%
3 Kimi K2 Moonshot AI 91.3%
4 Claude Opus 4.1 Anthropic 91.2%
5 Gemini Ultra Google 90.0%
6 Claude Opus 4.5 Anthropic 89.5%
7 Gemini 3 Pro Google 89.4%
8 GLM-5 Zhipu AI 88.7%
9 Claude Sonnet 4 Anthropic 88.7%
10 GPT-5.2 OpenAI 87.4%
11 GPT-5.1 OpenAI 87.0%
12 GPT-4 OpenAI 86.4%
13 DeepSeek V3.2 DeepSeek 86.2%
14 Claude Sonnet 3.7 Anthropic 86.1%
15 Gemini 2.5 Flash Google 83.2%
16 Mistral Large 3 Mistral 80.7%
17 GPT-5 OpenAI 80.6%
18 Gemini Pro Google 79.1%
19 Claude 2 Anthropic 78.5%
20 PaLM 2 Google 78.3%
21 Gemini 2.0 Flash Google 78.2%
22 DeepSeek-V3 DeepSeek 75.2%
23 Claude 3 Haiku Anthropic 75.2%
24 Claude 3.5 Sonnet Anthropic 75.1%
25 Claude 1.3 Anthropic 75.0%
26 Grok-2 xAI 70.9%
27 ChatGPT (GPT-3.5 Turbo) OpenAI 70.0%
28 Claude 3 Opus Anthropic 69.6%
29 GPT-4 Turbo OpenAI 69.4%
30 PaLM Google 69.3%
31 Llama 2 70B Meta 68.9%
32 Gemini 1.5 Pro Google 65.7%
33 Claude 3 Sonnet Anthropic 57.9%
34 Mistral Large Mistral 51.5%
35 Claude 2.1 Anthropic 49.5%
36 GPT-3 OpenAI 43.9%