Models.do

The Llama 90B Vision model is a top-tier, 90-billion-parameter multimodal model designed for the most challenging visual reasoning and language tasks. It offers unparalleled accuracy in image captioning, visual question answering, and advanced image-text comprehension. Pre-trained on vast multimodal datasets and fine-tuned with human feedback, the Llama 90B Vision is engineered to handle the most demanding image-based AI tasks.

This model is perfect for industries requiring cutting-edge multimodal AI capabilities, particularly those dealing with complex, real-time visual and textual analysis.

Click here for the original model card.

Usage of this model is subject to Meta's Acceptable Use Policy.

Provider	Model ID	Context	Max Output	Input Cost	Output Cost	Throughput	Latency
Together	together	131K	2K	$1.20/M	$1.20/M	29.5 t/s	895 ms
DeepInfra	deepInfra	33K	16K	$0.35/M	$0.40/M	18.9 t/s	2570 ms

Provider

Model ID

Context

Max Output

Input Cost

Output Cost

Throughput

Latency

Together

together

131K

$1.20/M

29.5 t/s

895 ms

DeepInfra

deepInfra

33K

16K

$0.35/M

$0.40/M

18.9 t/s

2570 ms

Meta: Llama 3.2 90B Vision Instruct

131,072 Token Context

Advanced Coding

Agentic Workflows

Vision Capabilities

Available On

Standard Pricing

Do Work. With AI.