HumanEval leaderboard
HumanEval is OpenAI's 164-problem hand-written programming benchmark testing function-completion correctness.
20 models ranked, highest score first.
| # | Model | Company | Score |
|---|---|---|---|
| 1 | GPT-5.3 Codex | OpenAI | 96.8% |
| 2 | Claude Sonnet 4.6 | Anthropic | 95.2% |
| 3 | GPT-5.2 Codex | OpenAI | 95.1% |
| 4 | DeepSeek V3.2 | DeepSeek | 92.5% |
| 5 | Claude 3.5 Sonnet | Anthropic | 92.0% |
| 6 | Mistral Large 3 | Mistral | 91.2% |
| 7 | Claude 3 Opus | Anthropic | 84.9% |
| 8 | DeepSeek-V3 | DeepSeek | 82.6% |
| 9 | Claude 3 Haiku | Anthropic | 75.9% |
| 10 | Gemini Ultra | 74.4% | |
| 11 | Claude 3 Sonnet | Anthropic | 73.0% |
| 12 | Gemini 1.5 Pro | 71.9% | |
| 13 | Claude 2 | Anthropic | 71.2% |
| 14 | Claude 2.1 | Anthropic | 70.0% |
| 15 | Gemini Pro | 67.7% | |
| 16 | GPT-4 | OpenAI | 67.0% |
| 17 | Code Llama 34B | Meta | 48.8% |
| 18 | ChatGPT (GPT-3.5 Turbo) | OpenAI | 48.1% |
| 19 | Llama 2 70B | Meta | 29.9% |
| 20 | Codex | OpenAI | 28.8% |