AI Model Ranking (LLM Leaderboard)
Most Expensive AI Models
Premium language models sorted by price
|
Model
AI model name and provider organization |
Input/1M
Cost per 1 million input tokens (text you send to the model) |
Output/1M
Cost per 1 million output tokens (text the model generates for you) |
MMLU-Pro
Massive Multitask Language Understanding (Professional) - tests broad knowledge across 14 subjects including STEM, humanities, and social sciences |
GPQA
Graduate-level Google-Proof Q&A benchmark - tests PhD-level reasoning and advanced intelligence |
AIME 2025
American Invitational Mathematics Examination 2025 - tests advanced mathematical problem-solving ability |
Release
When the model was released - newer models may have more capabilities | Compare |
|---|---|---|---|---|---|---|---|
| #1 o1-pro by OpenAI | $150.00 | $600.00 | - | - | - | Mar 19, 2025 | |
| #2 GPT-4 by OpenAI | $30.00 | $60.00 | - | - | - | Mar 14, 2023 | |
| #3 o3-pro by OpenAI | $20.00 | $80.00 | - | 84.5% | - | Jun 10, 2025 | |
| #4 Claude 4.1 Opus (Reasoning) by Anthropic | $15.00 | $75.00 | 88.0% | 80.9% | 80.3% | Aug 5, 2025 | |
| #5 Claude 4.1 Opus (Non-reasoning) by Anthropic | $15.00 | $75.00 | - | - | - | Aug 5, 2025 | |
| #6 Claude 3 Opus by Anthropic | $15.00 | $75.00 | 69.6% | 48.9% | - | Mar 4, 2024 | |
| #7 Claude 4 Opus (Non-reasoning) by Anthropic | $15.00 | $75.00 | 86.0% | 70.1% | 36.3% | May 22, 2025 | |
| #8 Claude 4 Opus (Reasoning) by Anthropic | $15.00 | $75.00 | 87.3% | 79.6% | 73.3% | May 22, 2025 | |
| #9 o1-preview by OpenAI | $16.50 | $66.00 | - | - | - | Sep 12, 2024 | |
| #10 o1 by OpenAI | $15.00 | $60.00 | 84.1% | 74.7% | - | Dec 5, 2024 | |
| #11 GPT-4 Turbo by OpenAI | $10.00 | $30.00 | 69.4% | - | - | Nov 6, 2023 | |
| #12 Claude Opus 4.5 (Reasoning) by Anthropic | $5.00 | $25.00 | 89.5% | 86.6% | 91.3% | Nov 24, 2025 | |
| #13 Claude Opus 4.5 (Non-reasoning) by Anthropic | $5.00 | $25.00 | 88.9% | 81.0% | 62.7% | Nov 24, 2025 | |
| #14 GPT-4o (May '24) by OpenAI | $5.00 | $15.00 | 74.0% | 52.6% | - | May 13, 2024 | |
| #15 GPT-4o (March 2025, chatgpt-4o-latest) by OpenAI | $5.00 | $15.00 | 80.3% | 65.5% | 25.7% | Mar 27, 2025 | |
| #16 GPT-4o (ChatGPT) by OpenAI | $5.00 | $15.00 | 77.3% | 51.1% | - | Feb 15, 2025 | |
| #17 Claude 4.5 Sonnet (Reasoning) by Anthropic | $3.00 | $15.00 | 87.5% | 83.4% | 88.0% | Sep 29, 2025 | |
| #18 Claude 4.5 Sonnet (Non-reasoning) by Anthropic | $3.00 | $15.00 | 86.0% | 72.7% | 37.0% | Sep 29, 2025 | |
| #19 Grok 4 by xAI | $3.00 | $15.00 | 86.6% | 87.7% | 92.7% | Jul 10, 2025 | |
| #20 Claude 3.5 Sonnet (Oct '24) by Anthropic | $3.00 | $15.00 | 77.2% | 59.9% | - | Oct 22, 2024 | |
| #21 Claude 3.5 Sonnet (June '24) by Anthropic | $3.00 | $15.00 | 75.1% | 56.0% | - | Jun 21, 2024 | |
| #22 Claude 3 Sonnet by Anthropic | $3.00 | $15.00 | 57.9% | 40.0% | - | Mar 4, 2024 | |
| #23 Claude 4 Sonnet (Non-reasoning) by Anthropic | $3.00 | $15.00 | 83.7% | 68.3% | 38.0% | May 22, 2025 | |
| #24 Claude 4 Sonnet (Reasoning) by Anthropic | $3.00 | $15.00 | 84.2% | 77.7% | 74.3% | May 22, 2025 | |
| #25 Claude 3.7 Sonnet (Non-reasoning) by Anthropic | $3.00 | $15.00 | 80.3% | 65.6% | 21.0% | Feb 24, 2025 | |
| #26 Claude 3.7 Sonnet (Reasoning) by Anthropic | $3.00 | $15.00 | 83.7% | 77.2% | 56.3% | Feb 24, 2025 | |
| #27 Mistral Large (Feb '24) by Mistral | $4.00 | $12.00 | 51.5% | 35.1% | - | Feb 26, 2024 | |
| #28 Sonar Pro by Perplexity | $3.00 | $15.00 | 75.5% | 57.8% | - | Jan 21, 2025 | |
| #29 Grok 3 by xAI | $3.00 | $15.00 | 79.9% | 69.3% | 58.0% | Feb 19, 2025 | |
| #30 Command-R+ (Apr '24) by Cohere | $3.00 | $15.00 | 43.2% | 32.3% | - | Apr 4, 2024 | |
| #31 Nova Premier by Amazon | $2.50 | $12.50 | 73.3% | 56.9% | 17.3% | Apr 30, 2025 | |
| #32 Gemini 3 Pro Preview (high) by Google | $2.00 | $12.00 | 89.8% | 90.8% | 95.7% | Nov 18, 2025 | |
| #33 Command A by Cohere | $2.50 | $10.00 | 71.2% | 52.7% | 13.0% | Mar 13, 2025 | |
| #34 GPT-4o (Aug '24) by OpenAI | $2.50 | $10.00 | - | 52.1% | - | Aug 6, 2024 | |
| #35 GPT-4o (Nov '24) by OpenAI | $2.50 | $10.00 | 74.8% | 54.3% | 6.0% | Nov 20, 2024 | |
| #36 Command-R+ (Aug '24) by Cohere | $2.50 | $10.00 | 42.7% | 33.7% | - | Aug 30, 2024 | |
| #37 Llama 3.1 Instruct 405B by Meta | $3.75 | $6.75 | 73.2% | 51.5% | 3.0% | Jul 23, 2024 | |
| #38 Mistral Medium by Mistral | $2.75 | $8.10 | 49.1% | 34.9% | - | Dec 11, 2023 | |
| #39 Grok 2 (Dec '24) by xAI | $2.00 | $10.00 | 70.9% | 51.0% | - | Dec 12, 2024 | |
| #40 o3 by OpenAI | $2.00 | $8.00 | 85.3% | 82.7% | 88.3% | Apr 16, 2025 | |
| #41 Jamba 1.7 Large by AI21 Labs | $2.00 | $8.00 | 57.7% | 39.0% | 2.3% | Jul 7, 2025 | |
| #42 GPT-4.1 by OpenAI | $2.00 | $8.00 | 80.6% | 66.6% | 34.7% | Apr 14, 2025 | |
| #43 Jamba 1.6 Large by AI21 Labs | $2.00 | $8.00 | 56.5% | 38.7% | - | Mar 6, 2025 | |
| #44 Jamba 1.5 Large by AI21 Labs | $2.00 | $8.00 | 57.2% | 42.7% | - | Aug 22, 2024 | |
| #45 GPT-5 (minimal) by OpenAI | $1.25 | $10.00 | 80.6% | 67.3% | 31.7% | Aug 7, 2025 | |
| #46 GPT-5.1 (Non-reasoning) by OpenAI | $1.25 | $10.00 | 80.1% | 64.3% | 38.0% | Nov 13, 2025 | |
| #47 GPT-5 (low) by OpenAI | $1.25 | $10.00 | 86.0% | 80.8% | 83.0% | Aug 7, 2025 | |
| #48 GPT-5 (ChatGPT) by OpenAI | $1.25 | $10.00 | 82.0% | 68.6% | 48.3% | Aug 7, 2025 | |
| #49 GPT-5 (medium) by OpenAI | $1.25 | $10.00 | 86.7% | 84.2% | 91.7% | Aug 7, 2025 | |
| #50 GPT-5.1 (high) by OpenAI | $1.25 | $10.00 | 87.0% | 87.3% | 94.0% | Nov 13, 2025 | |
| #51 GPT-5 (high) by OpenAI | $1.25 | $10.00 | 87.1% | 85.4% | 94.3% | Aug 7, 2025 | |
| #52 GPT-5 Codex (high) by OpenAI | $1.25 | $10.00 | 86.5% | 83.7% | 98.7% | Sep 23, 2025 | |
| #53 Gemini 2.5 Pro by Google | $1.25 | $10.00 | 86.2% | 84.4% | 87.7% | Jun 5, 2025 | |
| #54 Gemini 2.5 Pro Preview (Mar' 25) by Google | $1.25 | $10.00 | 85.8% | 83.6% | - | Mar 25, 2025 | |
| #55 Gemini 2.5 Pro Preview (May' 25) by Google | $1.25 | $10.00 | 83.7% | 82.2% | - | May 6, 2025 | |
| #56 Qwen3 Coder 480B A35B Instruct by Alibaba | $1.50 | $7.50 | 78.8% | 61.8% | 39.3% | Jul 22, 2025 | |
| #57 Mistral Large 2 (Nov '24) by Mistral | $2.00 | $6.00 | 69.7% | 48.6% | 14.0% | Nov 18, 2024 | |
| #58 Mistral Large 2 (Jul '24) by Mistral | $2.00 | $6.00 | 68.3% | 47.2% | - | Jul 24, 2024 | |
| #59 Pixtral Large by Mistral | $2.00 | $6.00 | 70.1% | 50.5% | 2.3% | Nov 18, 2024 | |
| #60 Qwen2.5 Max by Alibaba | $1.60 | $6.40 | 76.2% | 58.7% | - | Jan 28, 2025 | |
| #61 Magistral Medium 1.2 by Mistral | $2.00 | $5.00 | 81.5% | 73.9% | 82.0% | Sep 18, 2025 | |
| #62 Magistral Medium 1 by Mistral | $2.00 | $5.00 | 75.3% | 67.9% | 40.3% | Jun 10, 2025 | |
| #63 Qwen3 VL 32B (Reasoning) by Alibaba | $0.70 | $8.40 | 81.8% | 73.3% | 84.7% | Oct 21, 2025 | |
| #64 Qwen3 235B A22B 2507 (Reasoning) by Alibaba | $0.70 | $8.40 | 84.3% | 79.0% | 91.0% | Jul 25, 2025 | |
| #65 Qwen3 VL 235B A22B (Reasoning) by Alibaba | $0.70 | $8.40 | 83.6% | 77.2% | 88.3% | Sep 23, 2025 | |
| #66 Qwen3 32B (Reasoning) by Alibaba | $0.70 | $8.40 | 79.8% | 66.8% | 73.0% | Apr 28, 2025 | |
| #67 Qwen3 235B A22B (Reasoning) by Alibaba | $0.70 | $8.40 | 82.8% | 70.0% | 82.0% | Apr 28, 2025 | |
| #68 Qwen3 Max by Alibaba | $1.20 | $6.00 | 84.1% | 76.4% | 80.7% | Sep 23, 2025 | |
| #69 Qwen3 Max Thinking by Alibaba | $1.20 | $6.00 | 82.4% | 77.6% | 82.3% | Nov 3, 2025 | |
| #70 Qwen3 Max (Preview) by Alibaba | $1.20 | $6.00 | 83.8% | 76.4% | 75.0% | Sep 5, 2025 | |
| #71 DeepSeek R1 0528 (May '25) by DeepSeek | $1.35 | $4.00 | 84.9% | 81.3% | 76.0% | May 28, 2025 | |
| #72 DeepSeek R1 (Jan '25) by DeepSeek | $1.35 | $4.00 | 84.4% | 70.8% | 68.0% | Jan 20, 2025 | |
| #73 Claude 4.5 Haiku (Reasoning) by Anthropic | $1.00 | $5.00 | 76.0% | 67.2% | 83.7% | Oct 15, 2025 | |
| #74 Claude 4.5 Haiku (Non-reasoning) by Anthropic | $1.00 | $5.00 | 80.0% | 64.6% | 39.0% | Oct 15, 2025 | |
| #75 Sonar Reasoning by Perplexity | $1.00 | $5.00 | - | 62.3% | - | Jan 28, 2025 | |
| #76 Reka Core by Reka AI | $2.00 | $2.00 | - | - | - | Apr 15, 2024 | |
| #77 o3-mini by OpenAI | $1.10 | $4.40 | 79.1% | 74.8% | - | Jan 31, 2025 | |
| #78 o4-mini (high) by OpenAI | $1.10 | $4.40 | 83.2% | 78.4% | 90.7% | Apr 16, 2025 | |
| #79 o3-mini (high) by OpenAI | $1.10 | $4.40 | 80.2% | 77.3% | - | Jan 31, 2025 | |
| #80 Qwen3 Next 80B A3B (Reasoning) by Alibaba | $0.50 | $6.00 | 82.4% | 75.9% | 84.3% | Sep 11, 2025 | |
| #81 Claude 3.5 Haiku by Anthropic | $0.80 | $4.00 | 63.4% | 40.8% | - | Oct 22, 2024 | |
| #82 Hermes 4 - Llama-3.1 405B (Reasoning) by Nous Research | $1.00 | $3.00 | 82.9% | 72.7% | 69.7% | Aug 27, 2025 | |
| #83 Hermes 4 - Llama-3.1 405B (Non-reasoning) by Nous Research | $1.00 | $3.00 | 72.9% | 53.6% | 15.3% | Aug 27, 2025 | |
| #84 Mistral Small (Feb '24) by Mistral | $1.00 | $3.00 | 41.9% | 30.2% | - | Feb 26, 2024 | |
| #85 Nova Pro by Amazon | $0.80 | $3.20 | 69.1% | 49.9% | 7.0% | Dec 3, 2024 | |
| #86 Qwen3 14B (Reasoning) by Alibaba | $0.35 | $4.20 | 77.4% | 60.4% | 55.7% | Apr 28, 2025 | |
| #87 Cogito v2.1 (Reasoning) by Deep Cogito | $1.25 | $1.25 | 84.9% | 76.8% | 72.7% | Nov 18, 2025 | |
| #88 DeepSeek V3 0324 by DeepSeek | $1.14 | $1.25 | 81.9% | 65.5% | 41.0% | Mar 25, 2025 | |
| #89 Qwen3 235B A22B 2507 Instruct by Alibaba | $0.70 | $2.80 | 82.8% | 75.3% | 71.7% | Jul 21, 2025 | |
| #90 Qwen3 VL 32B Instruct by Alibaba | $0.70 | $2.80 | 79.1% | 67.1% | 68.3% | Oct 21, 2025 | |
| #91 Qwen3 VL 235B A22B Instruct by Alibaba | $0.70 | $2.80 | 82.3% | 71.2% | 70.7% | Sep 23, 2025 | |
| #92 Qwen3 235B A22B (Non-reasoning) by Alibaba | $0.70 | $2.80 | 76.2% | 61.3% | 23.7% | Apr 28, 2025 | |
| #93 Qwen3 32B (Non-reasoning) by Alibaba | $0.70 | $2.80 | 72.7% | 53.5% | 19.7% | Apr 28, 2025 | |
| #94 Kimi K2 0905 by Moonshot AI | $0.99 | $2.50 | 81.9% | 76.7% | 57.3% | Sep 5, 2025 | |
| #95 Kimi K2 Thinking by Moonshot AI | $0.60 | $2.50 | 84.8% | 83.8% | 94.7% | Nov 6, 2025 | |
| #96 Kimi K2 by Moonshot AI | $0.60 | $2.50 | 82.4% | 76.6% | 57.0% | Jul 11, 2025 | |
| #97 GLM-4.6 (Reasoning) by Z AI | $0.60 | $2.20 | 82.9% | 78.0% | 86.0% | Sep 30, 2025 | |
| #98 GLM-4.6 (Non-reasoning) by Z AI | $0.60 | $2.20 | 78.4% | 63.2% | 44.3% | Sep 30, 2025 | |
| #99 Sonar by Perplexity | $1.00 | $1.00 | 68.9% | 47.1% | - | Jan 21, 2025 | |
| #100 Ring-1T by InclusionAI | $0.57 | $2.28 | 80.6% | 59.5% | 89.3% | Oct 13, 2025 | |
Understanding the AI Model Leaderboard
This comprehensive AI model leaderboard helps you compare and choose the best large language models (LLMs) for your needs. We track standardized AI benchmarks, token pricing, inference speed, and model capabilities across all major AI providers like OpenAI, Anthropic, Google, Meta, and DeepSeek.
Core AI Benchmarks Explained
- MMLU-Pro: Tests broad knowledge across 14 academic subjects including STEM, humanities, and social sciences - the foundational intelligence benchmark
- GPQA: Graduate-level Google-Proof Q&A benchmark - measures PhD-level reasoning and advanced problem-solving capabilities
- AIME 2025: American Invitational Mathematics Examination - evaluates elite mathematical reasoning and competition-level problem solving
- Coding Index: Composite score of LiveCodeBench, SciCode, and coding benchmarks - measures programming ability
- Math Index: Composite score of AIME, MATH-500, and mathematical reasoning tests
Key Metrics to Consider
- Token Pricing: Compare input vs output token costs per million - crucial for estimating API expenses and optimizing usage patterns
- Inference Speed: Measured in tokens/second - determines response time for chatbots, streaming, and real-time applications
- Release Date: Newer models often incorporate latest training techniques and updated knowledge cutoffs
- Benchmark Scores: Percentage scores (0-100%) make it easy to compare model capabilities at a glance
How to Choose the Right AI Model for Your Use Case
For Research & Analysis
Prioritize models with high MMLU-Pro (70%+) and GPQA (60%+) scores for complex reasoning tasks, academic research, and technical documentation
For Cost Optimization
Sort by input/output pricing - smaller models often deliver 80% of flagship performance at 10% of the cost for simple tasks
For Math & STEM
Filter by Math Index or AIME 2025 scores (50%+) for quantitative analysis, engineering calculations, and scientific applications
All benchmark scores and pricing data are updated daily from Artificial Analysis to reflect the latest model versions and capabilities. Use the sort filters above to find AI models by intelligence, cost, coding ability, math performance, speed, or release date.
Frequently Asked Questions
What is MMLU-Pro and why is it the standard AI intelligence benchmark?
MMLU-Pro (Massive Multitask Language Understanding - Professional) is the most comprehensive AI benchmark, testing models across 14 academic subjects including mathematics, science, history, law, and ethics. Scores range from 46% (basic competency) to 87% (near-expert level). Models scoring above 75% demonstrate strong general intelligence suitable for professional applications, while scores below 60% indicate limitations in complex reasoning tasks.
What does GPQA measure and which models score highest?
GPQA (Graduate-level Google-Proof Q&A) tests PhD-level reasoning with questions designed to be "Google-proof" - requiring deep understanding rather than simple fact retrieval. Top models like GPT-5.1 (87.3%), GPT-5 mini (82.8%), and o3 (82.7%) excel at GPQA, making them ideal for research, technical analysis, and complex problem-solving. Models below 50% GPQA struggle with advanced reasoning and may provide superficial answers to complex questions.
What is AIME 2025 and how does it evaluate AI mathematical ability?
AIME 2025 (American Invitational Mathematics Examination) is an elite math competition benchmark that tests advanced problem-solving, algebra, geometry, and number theory. Scores above 80% (like GPT-5 Codex at 98.7% or GPT-5.1 at 94%) indicate exceptional mathematical reasoning suitable for engineering, scientific computing, and quantitative analysis. Models scoring below 50% may struggle with multi-step mathematical problems or require explicit problem breakdown.
How is AI model pricing calculated and what's considered cost-effective?
AI model pricing is measured per 1 million tokens (approximately 750,000 words). Input pricing covers text you send, while output pricing covers generated responses. Budget models like Llama 3.3 70B cost $0.54/$0.71 per million tokens, mid-tier models like GPT-5 nano cost $0.05/$0.40, while premium models like GPT-5 cost $1.25/$10. For typical applications with 3:1 input-to-output ratio, budget models can be 10-20x cheaper than flagship models while maintaining 70-80% performance.
Which AI models are best for coding and programming tasks?
Sort by Coding Index to see top programming models. Our Coding Index combines LiveCodeBench, SciCode, and coding benchmarks. Top performers include GPT-5.1 (57.5 index), GPT-5 mini (51.4), and GPT-5 Codex (53.5). These models excel at code generation, debugging, refactoring, and explaining complex algorithms. For budget-conscious developers, models with 40+ coding index scores offer excellent value for routine programming tasks.
How often are AI model benchmarks and rankings updated?
Our leaderboard syncs daily with Artificial Analysis API to ensure benchmark scores (MMLU-Pro, GPQA, AIME 2025), pricing, and inference speed data reflect the latest model versions. New model releases appear immediately under the "Newest" sort option. Benchmark scores can change when providers release updated versions - for example, GPT-5.1 released in November 2025 achieved 69.7 intelligence compared to GPT-5's 68.5 from August 2025.
What inference speed (tokens/second) do I need for my application?
Inference speed determines how fast models generate responses. For real-time chatbots and interactive applications, target 100+ tokens/second (models like gpt-oss-120B at 340 tok/s). For background processing and batch jobs, 50-100 tok/s is sufficient. Premium reasoning models like GPT-5 (103 tok/s) balance speed and capability. Note that higher inference speed doesn't always mean better quality - slower models often deliver more thoughtful, detailed responses.
Can I test these AI models for free before committing?
Yes! Try our free AI chat interface to test different models instantly without creating an account. Many providers also offer free tiers: OpenAI (ChatGPT with daily limits), Anthropic (Claude with usage caps), Google (Gemini free tier), and open-source models like Llama 3.3. Compare performance on your specific use case before upgrading to paid plans.