AI Flash Report

HumanEval leaderboard

HumanEval is OpenAI's 164-problem hand-written programming benchmark testing function-completion correctness.

20 models ranked, highest score first.

HumanEval leaderboard — 20 models ranked by score
# Model Company Score
1 GPT-5.3 Codex OpenAI 96.8%
2 Claude Sonnet 4.6 Anthropic 95.2%
3 GPT-5.2 Codex OpenAI 95.1%
4 DeepSeek V3.2 DeepSeek 92.5%
5 Claude 3.5 Sonnet Anthropic 92.0%
6 Mistral Large 3 Mistral 91.2%
7 Claude 3 Opus Anthropic 84.9%
8 DeepSeek-V3 DeepSeek 82.6%
9 Claude 3 Haiku Anthropic 75.9%
10 Gemini Ultra Google 74.4%
11 Claude 3 Sonnet Anthropic 73.0%
12 Gemini 1.5 Pro Google 71.9%
13 Claude 2 Anthropic 71.2%
14 Claude 2.1 Anthropic 70.0%
15 Gemini Pro Google 67.7%
16 GPT-4 OpenAI 67.0%
17 Code Llama 34B Meta 48.8%
18 ChatGPT (GPT-3.5 Turbo) OpenAI 48.1%
19 Llama 2 70B Meta 29.9%
20 Codex OpenAI 28.8%