Best Multimodal AI Models

AI models that process text, images, audio, and video. Ranked by modality breadth and benchmark performance.

Updated April 2026 · Source: AI Flash Report model database

#ModelInput modalitiesOutput modalitiesContextInput/M
#1 Gemini 3.1 Pro
Google
text, image, audio, video, PDF text 2M tokens $2.50
#2 Gemini 3 Pro
Google
text, image, audio, video, PDF text 1M tokens $2.50
#3 Gemini 2.0 Flash
Google
text, image, audio, video text, image, audio 1M tokens $0.10
#4 Gemini Ultra
Google
text, image, audio, video text None
#5 Gemini 2.5 Flash
Google
text, image, audio, video text 1M tokens $0.30
#6 Gemini 1.5 Pro
Google
text, image, audio, video text 1M tokens
#7 GPT-5.2
OpenAI
text, image, audio text, audio 400K tokens $2.00
#8 GPT-5.1
OpenAI
text, image, audio text, audio 400K tokens $2.25
#9 GPT-5
OpenAI
text, image, audio text, audio 400K tokens $2.50
#10 Claude Sonnet 4.6
Anthropic
text, image, PDF text 500K tokens $3.00
#11 Claude Opus 4.5
Anthropic
text, image, PDF text 500K tokens $15.00
#12 Claude Sonnet 4
Anthropic
text, image, PDF text 200K tokens $3.00
#13 Claude Opus 4.1
Anthropic
text, image, PDF text 200K tokens $15.00
#14 Claude Sonnet 3.7
Anthropic
text, image, PDF text 200K tokens $3.00
#15 Claude 3.5 Sonnet
Anthropic
text, image, PDF text 200K tokens $3.00

More rankings