Xiaomi Models
Xiaomi logo

MiMo-V2-Omni

byXiaomi

MiMo-V2-Omni is a frontier omni-modal model that natively processes image, video, and audio inputs within a unified architecture. It combines strong multimodal perception with agentic capability - visual grounding, multi-step planning, tool use, and code execution - making it well-suited for complex real-world tasks that span modalities, 256K context window.

Input Price$0.00/1M tokens
Output Price$0.00/1M tokens
Intelligence43.4
Coding35.5

Specifications

Technical details and pricing.

ProviderXiaomi
Context Window262,144 tokens
Release DateMar 19, 2026
ModalitiesText, Audio, Image, Video β†’ Text
CapabilitiesVision, Audio Input

Benchmarks

7 benchmark scores from Artificial Analysis.

GPQA82.8%
HLE19.9%
SciCode36.7%
LCR66.7%
IFBench53.5%
Tau291.2%
TerminalBench Hard34.8%

Composite Indices

Intelligence, Coding, Math

Standard Benchmarks

Academic and industry benchmarks

Frequently Asked Questions

What is MiMo-V2-Omni good for?

Use MiMo-V2-Omni for everyday tasks like writing, summarizing, brainstorming, and getting clear explanations.

How much does MiMo-V2-Omni cost?

Pricing is based on usage. Current rates are $0.00/1M tokens for input and $0.00/1M tokens for output.

Can I try MiMo-V2-Omni for free?

Yes. You can start a chat instantly and test the model before deciding on a plan.

Does MiMo-V2-Omni support images or audio?

MiMo-V2-Omni can understand images.

Benchmarks and pricing are sourced from Artificial Analysis where available. OpenRouter specs are used as a fallback.