SWE-bench Verified leaderboard
SWE-Bench Verified is a curated subset of real-world software engineering tasks from GitHub issues, where models must produce a working patch.
14 models ranked, highest score first.
| # | Model | Company | Score |
|---|---|---|---|
| 1 | GPT-5.3 Codex | OpenAI | 82.4% |
| 2 | Claude Sonnet 4.6 | Anthropic | 80.8% |
| 3 | Claude Opus 4.5 | Anthropic | 78.9% |
| 4 | GPT-5.2 Codex | OpenAI | 78.2% |
| 5 | Claude Opus 4.1 | Anthropic | 74.5% |
| 6 | GPT-5.2 | OpenAI | 72.5% |
| 7 | Gemini 3.1 Pro | 72.3% | |
| 8 | Claude Sonnet 4 | Anthropic | 72.3% |
| 9 | GPT-5.1 | OpenAI | 70.1% |
| 10 | Gemini 3 Pro | 68.2% | |
| 11 | GPT-5 | OpenAI | 67.4% |
| 12 | Kimi K2 | Moonshot AI | 65.8% |
| 13 | Claude Sonnet 3.7 | Anthropic | 62.3% |
| 14 | Claude 3.5 Sonnet | Anthropic | 49.0% |