AI Flash Report

HumanEval leaderboard

HumanEval is OpenAI's 164-problem hand-written programming benchmark testing function-completion correctness.

26 models ranked, highest score first.

HumanEval leaderboard — 26 models ranked by score
# Model Company Score
1 GPT-5.3 Codex OpenAI 96.8%
2 Gemini 2.5 Flash Google 96.2%
3 Claude Sonnet 4.6 Anthropic 95.2%
4 GPT-5.2 Codex OpenAI 95.1%
5 GPT-5 OpenAI 95.1%
6 DeepSeek V3.2 DeepSeek 92.5%
7 GPT-4 Turbo OpenAI 91.8%
8 Mistral Large 3 Mistral 91.2%
9 Gemini 2.0 Flash Google 90.7%
10 DeepSeek-V3 DeepSeek 90.6%
11 Claude 3.5 Sonnet Anthropic 89.9%
12 Grok-2 xAI 86.3%
13 Claude 3 Opus Anthropic 84.8%
14 Gemini 1.5 Pro Google 83.4%
15 Claude 3 Haiku Anthropic 75.7%
16 Gemini Ultra Google 74.4%
17 Claude 3 Sonnet Anthropic 71.3%
18 Claude 2 Anthropic 71.2%
19 Mistral Large Mistral 70.6%
20 Gemini Pro Google 67.7%
21 GPT-4 OpenAI 67.0%
22 Code Llama 34B Meta 48.8%
23 ChatGPT (GPT-3.5 Turbo) OpenAI 48.1%
24 Llama 2 70B Meta 29.9%
25 Codex OpenAI 28.8%
26 Claude 2.1 Anthropic 15.9%