AI Flash Report

SWE-bench Verified leaderboard

SWE-Bench Verified is a curated subset of real-world software engineering tasks from GitHub issues, where models must produce a working patch.

14 models ranked, highest score first.

SWE-bench Verified leaderboard — 14 models ranked by score
# Model Company Score
1 GPT-5.3 Codex OpenAI 82.4%
2 Claude Sonnet 4.6 Anthropic 80.8%
3 Claude Opus 4.5 Anthropic 78.9%
4 GPT-5.2 Codex OpenAI 78.2%
5 Claude Opus 4.1 Anthropic 74.5%
6 GPT-5.2 OpenAI 72.5%
7 Gemini 3.1 Pro Google 72.3%
8 Claude Sonnet 4 Anthropic 72.3%
9 GPT-5.1 OpenAI 70.1%
10 Gemini 3 Pro Google 68.2%
11 GPT-5 OpenAI 67.4%
12 Kimi K2 Moonshot AI 65.8%
13 Claude Sonnet 3.7 Anthropic 62.3%
14 Claude 3.5 Sonnet Anthropic 49.0%