AI Flash Report

MMLU-Pro leaderboard

MMLU-Pro is a 12,000-question multitask benchmark covering 14 subject areas, designed to differentiate frontier models on harder reasoning.

33 models ranked, highest score first.

MMLU-Pro leaderboard — 33 models ranked by score
# Model Company Score
1 Gemini 3.1 Pro Google 93.8%
2 Claude Opus 4.5 Anthropic 92.8%
3 Claude Sonnet 4.6 Anthropic 92.1%
4 Kimi K2 Moonshot AI 91.3%
5 Claude Opus 4.1 Anthropic 91.2%
6 GPT-5.2 OpenAI 90.8%
7 DeepSeek V3.2 DeepSeek 90.1%
8 Gemini Ultra Google 90.0%
9 Mistral Large 3 Mistral 89.4%
10 Gemini 3 Pro Google 89.4%
11 GPT-5.1 OpenAI 89.2%
12 GLM-5 Zhipu AI 88.7%
13 Claude Sonnet 4 Anthropic 88.7%
14 Claude 3.5 Sonnet Anthropic 88.7%
15 DeepSeek-V3 DeepSeek 88.5%
16 GPT-5 OpenAI 87.5%
17 Claude 3 Opus Anthropic 86.8%
18 GPT-4 OpenAI 86.4%
19 Claude Sonnet 3.7 Anthropic 86.1%
20 Gemini 2.5 Flash Google 82.8%
21 Gemini 1.5 Pro Google 81.9%
22 Gemini Pro Google 79.1%
23 Claude 3 Sonnet Anthropic 79.0%
24 Claude 2 Anthropic 78.5%
25 PaLM 2 Google 78.3%
26 Gemini 2.0 Flash Google 76.4%
27 Claude 3 Haiku Anthropic 75.2%
28 Claude 1.3 Anthropic 75.0%
29 Claude 2.1 Anthropic 73.1%
30 ChatGPT (GPT-3.5 Turbo) OpenAI 70.0%
31 PaLM Google 69.3%
32 Llama 2 70B Meta 68.9%
33 GPT-3 OpenAI 43.9%