Back

Qwen: Qwen2.5-VL 7B Instruct

Qwen
Input: text
Input: image
Output: text
Released: Aug 28, 2024Updated: Mar 28, 2025

Qwen2.5 VL 7B is a multimodal LLM from the Qwen Team with the following key enhancements:

  • SoTA understanding of images of various resolution & ratio: Qwen2.5-VL achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc.

  • Understanding videos of 20min+: Qwen2.5-VL can understand videos over 20 minutes for high-quality video-based question answering, dialog, content creation, etc.

  • Agent that can operate your mobiles, robots, etc.: with the abilities of complex reasoning and decision making, Qwen2.5-VL can be integrated with devices like mobile phones, robots, etc., for automatic operation based on visual environment and text instructions.

  • Multilingual Support: to serve global users, besides English and Chinese, Qwen2.5-VL now supports the understanding of texts in different languages inside images, including most European languages, Japanese, Korean, Arabic, Vietnamese, etc.

For more details, see this blog post and GitHub repo.

Usage of this model is subject to Tongyi Qianwen LICENSE AGREEMENT.

32,768 Token Context

Process and analyze large documents and conversations.

Advanced Coding

Improved capabilities in front-end development and full-stack updates.

Agentic Workflows

Autonomously navigate multi-step processes with improved reliability.

Vision Capabilities

Process and understand images alongside text inputs.

Available On

ProviderModel IDContextMax OutputInput CostOutput CostThroughputLatency
Hyperbolichyperbolic33K-$0.20/M$0.20/M61.8 t/s715 ms
inference.netinferenceNet128K8K$0.20/M$0.20/M51.5 t/s2217 ms
kluster.aiklusterAi33K33K$0.30/M$0.30/M22.7 t/s3406 ms
Standard Pricing
Input Tokens
$0.0000002

per 1K tokens

Output Tokens
$0.0000002

per 1K tokens

Image Processing
$0.0001445

per image

Do Work. With AI.