AI Flash Report

TAU2-bench leaderboard

88 models ranked, highest score first.

TAU2-bench leaderboard — 88 models ranked by score
# Model Company Score
1 JT-35B-Flash China Mobile 99.1%
2 Claude Fable 5 Anthropic 98.5%
3 Step 3.7 Flash StepFun 98.5%
4 GLM 5V Turbo Z AI 98.5%
5 GLM-5-Turbo Z AI 98.5%
6 Grok 4.3 xAI 97.7%
7 GLM-5.1 Z AI 97.7%
8 Qwen3.6 Plus Alibaba 97.7%
9 Grok 4.20 0309 xAI 96.5%
10 DeepSeek V4 Pro DeepSeek 96.2%
11 Qwen3.6 Max Preview Alibaba 95.9%
12 Kimi K2.6 Kimi 95.9%
13 Qwen3.6 35B A3B Alibaba 95.3%
14 DeepSeek V4 Flash DeepSeek 95.0%
15 MiMo-V2-Pro Xiaomi 95.0%
16 Qwen3.7 Max Alibaba 94.7%
17 Claude Opus 4.8 Anthropic 94.4%
18 Mistral Medium 3.5 Mistral 94.2%
19 Qwen3.6 27B Alibaba 94.2%
20 MiMo-V2.5-Pro Xiaomi 94.2%
21 GPT-5.5 OpenAI 93.9%
22 Qwen3.5 27B Alibaba 93.9%
23 Qwen3.5 122B A10B Alibaba 93.6%
24 Qwen3.7 Plus Alibaba 93.0%
25 JT-MINI China Mobile 93.0%
26 Grok 4.20 0309 v2 xAI 93.0%
27 Hy3-preview Tencent 92.7%
28 Ring-2.6-1T InclusionAI 92.4%
29 Qwen3.5 4B Alibaba 92.1%
30 Muse Spark Meta 91.5%
31 MiMo-V2-Omni Xiaomi 91.2%
32 MiMo-V2.5 Xiaomi 90.6%
33 DeepSeek V3.2 DeepSeek 90.6%
34 Trinity Large Thinking Arcee AI 90.1%
35 Ling-2.6-1T InclusionAI 89.8%
36 Claude Opus 4.5 Anthropic 89.5%
37 Qwen3.5 35B A3B Alibaba 89.2%
38 MiniMax-M3 MiniMax 88.9%
39 Claude Opus 4.7 Anthropic 88.6%
40 Qwen3.5 Omni Plus Alibaba 88.3%
41 MiMo-V2-Omni-0327 Xiaomi 88.0%
42 MiniCPM-V 4.6 1.3B OpenBMB 87.7%
43 Step 3.5 Flash 2603 StepFun 87.4%
44 GPT-5.4 OpenAI 87.1%
45 Qwen3.5 9B Alibaba 86.8%
46 Solar Pro 3 Upstage 86.3%
47 Ling 2.6 Flash InclusionAI 86.0%
48 GPT-5.3 Codex OpenAI 86.0%
49 MiniMax-M2.7 MiniMax 84.8%
50 GPT-5.2 OpenAI 84.8%
51 Qwen3.5 Omni Flash Alibaba 84.5%
52 Nemotron 3 Ultra 550B A55B NVIDIA 83.3%
53 GPT-5.4 mini OpenAI 83.3%
54 GPT-5.1 OpenAI 81.9%
55 MiniCPM5-1B OpenBMB 81.0%
56 Claude Sonnet 4.6 Anthropic 78.9%
57 EXAONE 4.5 33B LG AI Research 78.1%
58 GPT-5.4 nano OpenAI 76.0%
59 Qwen3.5 2B Alibaba 69.0%
60 NVIDIA Nemotron 3 Super 120B A12B NVIDIA 67.8%
61 GPT-5 OpenAI 67.0%
62 HyperNova 60B 2605 Multiverse Computing 63.2%
63 Gemma 4 31B Google 59.9%
64 Gemini 3.5 Flash Google 58.8%
65 Nemotron Cascade 2 30B A3B NVIDIA 53.2%
66 GPT-5.5 Instant OpenAI 49.4%
67 Qwen3.5 0.8B Alibaba 47.7%
68 Sarvam 105B Sarvam 46.8%
69 Nemotron 3 Nano Omni 30B A3B Reasoning NVIDIA 45.3%
70 Gemma 4 26B A4B Google 43.6%
71 Granite 4.1 30B IBM 42.1%
72 Mistral Small 4 Mistral 41.2%
73 North Mini Code Cohere 37.4%
74 Gemma 4 12B Google 36.3%
75 Sarvam 30B Sarvam 34.5%
76 Gemini 2.5 Flash Google 31.6%
77 Gemini 3.1 Flash-Lite Preview Google 31.3%
78 Gemini 2.0 Flash Google 29.5%
79 NVIDIA Nemotron 3 Nano 4B NVIDIA 28.1%
80 Granite 4.1 8B IBM 27.8%
81 Mistral Large 3 Mistral 24.6%
82 DeepSeek-V3 DeepSeek 22.8%
83 Claude 3 Haiku Anthropic 21.1%
84 Gemma 4 E4B Google 20.8%
85 Gemma 4 E2B Google 20.8%
86 Granite 4.1 3B IBM 19.6%
87 LFM2.5-8B-A1B Liquid AI 16.1%
88 LFM2 24B A2B Liquid AI 11.1%