AI Flash Report

TAU2-bench leaderboard

88 models ranked, highest score first.

TAU2-bench leaderboard — 88 models ranked by score
# Model Company Score
1 JT-35B-Flash China Mobile 99.1%
2 Step 3.7 Flash StepFun 98.5%
3 GLM 5V Turbo Z AI 98.5%
4 GLM-5-Turbo Z AI 98.5%
5 Grok 4.3 xAI 97.7%
6 GLM-5.1 Z AI 97.7%
7 Qwen3.6 Plus Alibaba 97.7%
8 Grok 4.20 0309 xAI 96.5%
9 DeepSeek V4 Pro DeepSeek 96.2%
10 Qwen3.6 Max Preview Alibaba 95.9%
11 Kimi K2.6 Kimi 95.9%
12 Gemini 3.1 Pro Preview Google 95.6%
13 Qwen3.5 397B A17B Alibaba 95.6%
14 Qwen3.6 35B A3B Alibaba 95.3%
15 DeepSeek V4 Flash DeepSeek 95.0%
16 MiMo-V2-Pro Xiaomi 95.0%
17 Qwen3.7 Max Alibaba 94.7%
18 Claude Opus 4.8 Anthropic 94.4%
19 Mistral Medium 3.5 Mistral 94.2%
20 Qwen3.6 27B Alibaba 94.2%
21 MiMo-V2.5-Pro Xiaomi 94.2%
22 GPT-5.5 OpenAI 93.9%
23 Qwen3.5 27B Alibaba 93.9%
24 Qwen3.5 122B A10B Alibaba 93.6%
25 Qwen3.7 Plus Alibaba 93.0%
26 JT-MINI China Mobile 93.0%
27 Grok 4.20 0309 v2 xAI 93.0%
28 Hy3-preview Tencent 92.7%
29 Ring-2.6-1T InclusionAI 92.4%
30 Qwen3.5 4B Alibaba 92.1%
31 Muse Spark Meta 91.5%
32 MiMo-V2-Omni Xiaomi 91.2%
33 MiMo-V2.5 Xiaomi 90.6%
34 DeepSeek V3.2 DeepSeek 90.6%
35 Trinity Large Thinking Arcee AI 90.1%
36 Ling-2.6-1T InclusionAI 89.8%
37 Claude Opus 4.5 Anthropic 89.5%
38 Qwen3.5 35B A3B Alibaba 89.2%
39 MiniMax-M3 MiniMax 88.9%
40 Claude Opus 4.7 Anthropic 88.6%
41 Qwen3.5 Omni Plus Alibaba 88.3%
42 MiMo-V2-Omni-0327 Xiaomi 88.0%
43 MiniCPM-V 4.6 1.3B OpenBMB 87.7%
44 Step 3.5 Flash 2603 StepFun 87.4%
45 GPT-5.4 OpenAI 87.1%
46 Qwen3.5 9B Alibaba 86.8%
47 Solar Pro 3 Upstage 86.3%
48 Ling 2.6 Flash InclusionAI 86.0%
49 GPT-5.3 Codex OpenAI 86.0%
50 MiniMax-M2.7 MiniMax 84.8%
51 GPT-5.2 OpenAI 84.8%
52 Qwen3.5 Omni Flash Alibaba 84.5%
53 Nemotron 3 Ultra 550B A55B NVIDIA 83.3%
54 GPT-5.4 mini OpenAI 83.3%
55 GPT-5.1 OpenAI 81.9%
56 MiniCPM5-1B OpenBMB 81.0%
57 Claude Sonnet 4.6 Anthropic 78.9%
58 EXAONE 4.5 33B LG AI Research 78.1%
59 GPT-5.4 nano OpenAI 76.0%
60 Mercury 2 Inception 70.8%
61 Qwen3.5 2B Alibaba 69.0%
62 NVIDIA Nemotron 3 Super 120B A12B NVIDIA 67.8%
63 GPT-5 OpenAI 67.0%
64 Gemma 4 31B Google 59.9%
65 Gemini 3.5 Flash Google 58.8%
66 Nemotron Cascade 2 30B A3B NVIDIA 53.2%
67 GPT-5.5 Instant OpenAI 49.4%
68 Qwen3.5 0.8B Alibaba 47.7%
69 Sarvam 105B Sarvam 46.8%
70 Nemotron 3 Nano Omni 30B A3B Reasoning NVIDIA 45.3%
71 Gemma 4 26B A4B Google 43.6%
72 Granite 4.1 30B IBM 42.1%
73 Mistral Small 4 Mistral 41.2%
74 Gemma 4 12B Google 34.8%
75 Sarvam 30B Sarvam 34.5%
76 Gemini 2.5 Flash Google 31.6%
77 Gemini 3.1 Flash-Lite Preview Google 31.3%
78 Gemini 2.0 Flash Google 29.5%
79 NVIDIA Nemotron 3 Nano 4B NVIDIA 28.1%
80 Granite 4.1 8B IBM 27.8%
81 Mistral Large 3 Mistral 24.6%
82 DeepSeek-V3 DeepSeek 22.8%
83 Claude 3 Haiku Anthropic 21.1%
84 Gemma 4 E4B Google 20.8%
85 Gemma 4 E2B Google 20.8%
86 Granite 4.1 3B IBM 19.6%
87 LFM2 24B A2B Liquid AI 11.1%
88 Tiny Aya Global Cohere 0.0%