What is the best multimodal ai models in 2026?

Question

Accepted Answer

Based on current benchmark data, Google Gemini 3.1 Pro ranks #1 for best multimodal ai models as of April 2026. See the full ranked list below.

#	Model	Input modalities	Output modalities	Context	Input/M
#1	Gemini 3.1 Pro Google	text, image, audio, video, PDF	text	2M tokens	$2.50
#2	Gemini 3 Pro Google	text, image, audio, video, PDF	text	1M tokens	$2.50
#3	Gemini 2.0 Flash Google	text, image, audio, video	text, image, audio	1M tokens	$0.10
#4	Gemini Ultra Google	text, image, audio, video	text	None	—
#5	Gemini 2.5 Flash Google	text, image, audio, video	text	1M tokens	$0.30
#6	Gemini 1.5 Pro Google	text, image, audio, video	text	1M tokens	—
#7	GPT-5.2 OpenAI	text, image, audio	text, audio	400K tokens	$2.00
#8	GPT-5.1 OpenAI	text, image, audio	text, audio	400K tokens	$2.25
#9	GPT-5 OpenAI	text, image, audio	text, audio	400K tokens	$2.50
#10	Claude Sonnet 4.6 Anthropic	text, image, PDF	text	500K tokens	$3.00
#11	Claude Opus 4.5 Anthropic	text, image, PDF	text	500K tokens	$15.00
#12	Claude Sonnet 4 Anthropic	text, image, PDF	text	200K tokens	$3.00
#13	Claude Opus 4.1 Anthropic	text, image, PDF	text	200K tokens	$15.00
#14	Claude Sonnet 3.7 Anthropic	text, image, PDF	text	200K tokens	$3.00
#15	Claude 3.5 Sonnet Anthropic	text, image, PDF	text	200K tokens	$3.00

Best Multimodal AI Models