MiMo-V2-Omni
byXiaomi
MiMo-V2-Omni is a frontier omni-modal model that natively processes image, video, and audio inputs within a unified architecture. It combines strong multimodal perception with agentic capability - visual grounding, multi-step planning, tool use, and code execution - making it well-suited for complex real-world tasks that span modalities, 256K context window.
Specifications
Technical details and pricing.
Benchmarks
7 benchmark scores from Artificial Analysis.
Composite Indices
Intelligence, Coding, Math
Standard Benchmarks
Academic and industry benchmarks
Frequently Asked Questions
What is MiMo-V2-Omni good for?
Use MiMo-V2-Omni for everyday tasks like writing, summarizing, brainstorming, and getting clear explanations.
How much does MiMo-V2-Omni cost?
Pricing is based on usage. Current rates are $0.00/1M tokens for input and $0.00/1M tokens for output.
Can I try MiMo-V2-Omni for free?
Yes. You can start a chat instantly and test the model before deciding on a plan.
Does MiMo-V2-Omni support images or audio?
MiMo-V2-Omni can understand images.
Similar Models
Other models you might want to explore.
Benchmarks and pricing are sourced from Artificial Analysis where available. OpenRouter specs are used as a fallback.