Llama 3.3 Nemotron Super 49B v1 (Reasoning) vs Grok 4

Comparing 2 AI models · 6 benchmarks · NVIDIA, xAI

Most Affordable

Llama 3.3 Nemotron Super 49B v1 (Reasoning)

$0.00/1M

Highest Intelligence

Grok 4

87.7% GPQA

Best for Coding

Grok 4

55.1 Coding Index

Price Difference

Infinityx

input cost range

Composite Indices

Intelligence, Coding, Math

Standard Benchmarks

Academic and industry benchmarks

Benchmark Winners

6 tests

Llama 3.3 Nemotron Super 49B v1 (Reasoning)

No clear wins

Grok 4

GPQA
MMLU Pro
HLE
LiveCodeBench
MATH 500
AIME 2025

Metric	NV Llama 3.3 Nemotron Super 49B v1 (Reasoning) NVIDIA	xA Grok 4 xAI
Pricing Per 1M tokens
Input Cost	$0.00/1M	$3.00/1M
Output Cost	$0.00/1M	$15.00/1M
Blended Cost 3:1 input/output ratio	—	$6.00/1M
Specifications
Organization Model creator	NVIDIA	xAI
Release Date Launch date	Mar 18, 2025	Jul 10, 2025
Performance & Speed
Throughput Output speed	—	37.2 tok/s
Time to First Token (TTFT) Initial response delay	—	9172ms
Latency Time to first answer token	—	9172ms
Composite Indices
Intelligence Index Overall reasoning capability	35.5	65.3
Coding Index Programming ability	18.7	55.1
Math Index Mathematical reasoning	54.7	92.7
Standard Benchmarks
GPQA Graduate-level reasoning	64.3%	87.7%
MMLU Pro Advanced knowledge	78.5%	86.6%
HLE Hard language evaluation	6.5%	23.9%
LiveCodeBench Real-world coding tasks	27.7%	81.9%
MATH 500 Mathematical problems	95.9%	99.0%
AIME 2025 Advanced math competition	54.7%	92.7%
AIME (Original) Math olympiad problems	58.3%	94.3%
SciCode Scientific code generation	28.2%	45.7%
LCR Code review capability	17.0%	68.0%
IFBench Instruction-following	38.1%	53.7%
TAU-bench v2 Tool use & agentic tasks	—	74.9%
TerminalBench Hard CLI command generation	0.0%	37.6%

Key Takeaways

Llama 3.3 Nemotron Super 49B v1 (Reasoning) offers the best value at $0.00/1M, making it ideal for high-volume applications and cost-conscious projects.

Grok 4 leads in reasoning capabilities with a 87.7% GPQA score, excelling at complex analytical tasks and problem-solving.

Grok 4 achieves a 55.1 coding index, making it the top choice for software development and code generation tasks.

All models support context windows of ∞