AI models that process text, images, audio, and video. Ranked by modality breadth and benchmark performance.
Updated April 2026 · Source: AI Flash Report model database
| # | Model | Input modalities | Output modalities | Context | Input/M |
|---|---|---|---|---|---|
| #1 |
Gemini 3.1 Pro
Google
|
text, image, audio, video, PDF | text | 2M tokens | $2.50 |
| #2 |
Gemini 3 Pro
Google
|
text, image, audio, video, PDF | text | 1M tokens | $2.50 |
| #3 |
Gemini 2.0 Flash
Google
|
text, image, audio, video | text, image, audio | 1M tokens | $0.10 |
| #4 |
Gemini Ultra
Google
|
text, image, audio, video | text | None | — |
| #5 |
Gemini 2.5 Flash
Google
|
text, image, audio, video | text | 1M tokens | $0.30 |
| #6 |
Gemini 1.5 Pro
Google
|
text, image, audio, video | text | 1M tokens | — |
| #7 |
GPT-5.2
OpenAI
|
text, image, audio | text, audio | 400K tokens | $2.00 |
| #8 |
GPT-5.1
OpenAI
|
text, image, audio | text, audio | 400K tokens | $2.25 |
| #9 |
GPT-5
OpenAI
|
text, image, audio | text, audio | 400K tokens | $2.50 |
| #10 |
Claude Sonnet 4.6
Anthropic
|
text, image, PDF | text | 500K tokens | $3.00 |
| #11 |
Claude Opus 4.5
Anthropic
|
text, image, PDF | text | 500K tokens | $15.00 |
| #12 |
Claude Sonnet 4
Anthropic
|
text, image, PDF | text | 200K tokens | $3.00 |
| #13 |
Claude Opus 4.1
Anthropic
|
text, image, PDF | text | 200K tokens | $15.00 |
| #14 |
Claude Sonnet 3.7
Anthropic
|
text, image, PDF | text | 200K tokens | $3.00 |
| #15 |
Claude 3.5 Sonnet
Anthropic
|
text, image, PDF | text | 200K tokens | $3.00 |