Qwen3 VL 32B Instruct

Name: Qwen3 VL 32B Instruct
Brand: Qwen
Price: 0.7 USD

32B

byQwen

Qwen3-VL-32B-Instruct is a large-scale multimodal vision-language model designed for high-precision understanding and reasoning across text, images, and video. With 32 billion parameters, it combines deep visual perception with advanced text...

Chat withQwen3 VL 32B Instruct

Input Price$0.70/1M tokens

Output Price$8.40/1M tokens

Intelligence24.7

Coding14.5

Specifications

Technical details and pricing.

ProviderQwen

Context Window131,072 tokens

Release DateOct 21, 2025

ModalitiesText, Image → Text

CapabilitiesVision

Benchmarks

10 benchmark scores from Artificial Analysis.

GPQA73.3%

MMLU Pro81.8%

HLE9.6%

LiveCodeBench73.8%

AIME 202584.7%

SciCode28.5%

LCR55.3%

IFBench59.4%

Tau245.6%

TerminalBench Hard7.6%

Composite Indices

Intelligence, Coding, Math

Standard Benchmarks

Academic and industry benchmarks

Frequently Asked Questions

What is Qwen3 VL 32B Instruct good for?

Use Qwen3 VL 32B Instruct for everyday tasks like writing, summarizing, brainstorming, and getting clear explanations.

How much does Qwen3 VL 32B Instruct cost?

Pricing is based on usage. Current rates are $0.70/1M tokens for input and $8.40/1M tokens for output.

Can I try Qwen3 VL 32B Instruct for free?

Yes. You can start a chat instantly and test the model before deciding on a plan.

Does Qwen3 VL 32B Instruct support images or audio?

Qwen3 VL 32B Instruct can understand images.

Similar Models

Other models you might want to explore.

Qwen3 VL 8B Instruct

Qwen

Qwen3-VL-8B-Instruct is a multimodal vision-language model from the Qwen3-VL series, built for high-fidelity understanding and reasoning across text, images, and video.

Details →

Qwen3 VL 8B Thinking

Qwen

Qwen3-VL-8B-Thinking is the reasoning-optimized variant of the Qwen3-VL-8B multimodal model, designed for advanced visual and textual reasoning across complex scenes, documents, and temporal sequences.

Details →

Qwen3.5 397B A17B

Qwen

The Qwen3.5 series 397B-A17B native vision-language model is built on a hybrid architecture that integrates a linear attention mechanism with a sparse mixture-of-experts model, achieving higher inference efficiency.

Details →

Benchmarks and pricing are sourced from Artificial Analysis where available. OpenRouter specs are used as a fallback.