TurboQuant is a quantization technique developed by Together AI that achieves near-lossless INT4 weight compression for large language models through structured, hardware-aligned quantization with minimal calibration overhead.
TurboQuant is a post-training quantization (PTQ) method introduced by Together AI that compresses large language model weights to INT4 precision while preserving model quality close to the FP16 baseline. It is specifically designed to be fast to apply (requiring only a small calibration dataset), to produce kernels that fully utilize modern GPU tensor core throughput, and to work across a wide range of model families without per-model tuning.
Core Design Goals
Near-lossless INT4 compression
TurboQuant targets the INT4 weight, FP16 activation (W4A16) configuration. At INT4, weights occupy 4 bits per element rather than 16, yielding a 4× reduction in weight memory and a corresponding improvement in memory-bandwidth-bound inference throughput.
Hardware alignment
Quantization groups and tile sizes are chosen to match the memory access patterns of GPU tensor cores, ensuring that the dequantization overhead during inference is minimized and that peak hardware utilization is achievable.
Fast calibration
Unlike methods that require expensive layer-by-layer optimization or gradient-based fine-tuning, TurboQuant can be applied with a small unlabeled calibration corpus (typically 512–1024 samples) in minutes rather than hours.
Relationship to Other Quantization Methods
TurboQuant belongs to the family of weight-only PTQ methods alongside:
- GPTQ: Iterative second-order weight updates per layer; produces accurate results but can be slow for very large models.
- AWQ (Activation-Aware Weight Quantization): Identifies salient channels based on activation magnitudes and applies per-channel scaling; fast and accurate.
- SqueezeLLM: Combines dense and sparse quantization for non-uniform bit allocation.
- QuIP#: Incoherence processing via random orthogonal transforms for near-lossless INT2–INT4 compression.
TurboQuant's focus is on combining the throughput benefits of hardware-aligned INT4 kernels with the practicality of a simple calibration pipeline, positioning it as a production-ready option rather than a research prototype.
How INT4 Quantization Works
LLM weights are quantized using a group quantization scheme:
- Weights are divided into groups of g elements (commonly g = 128).
- For each group, a scale factor and zero-point are computed from the min/max values.
- Each weight value is mapped to the nearest integer in [0, 15] (unsigned INT4).
- During inference, weights are dequantized on-the-fly before matrix multiplication.
The group size g controls the accuracy–compression trade-off: smaller groups yield more accurate quantization at the cost of higher metadata overhead.
Performance Characteristics
Memory savings
A 70B parameter model in FP16 requires approximately 140 GB of memory. In W4A16 INT4, this falls to ~40 GB, making it possible to run 70B models on a single 80 GB A100 or H100.
Throughput gains
Because transformer inference is often memory-bandwidth-bound during autoregressive decoding, reducing weight size by 4× can yield close to 4× token generation throughput on the same hardware—provided the dequantization kernel is efficient.
Accuracy
On standard benchmarks (MMLU, HellaSwag, GSM8K), well-implemented W4A16 quantization typically shows a degradation of less than 1–2 percentage points compared to FP16, with larger models being more resilient to quantization.
Use Cases
- Serving large models on fewer GPUs: Reduces hardware cost and deployment complexity.
- Increased batch sizes: Lower per-token memory footprint allows larger batches at the same GPU memory budget.
- Edge and on-device inference: Enables models that would otherwise not fit on consumer hardware.
- Cost-efficient API serving: Higher throughput per GPU reduces per-token inference cost.
TurboQuant represents the broader industry trend toward deploying quantized LLMs in production, where the goal is to match FP16 quality as closely as possible while maximizing the efficiency gains that lower-precision arithmetic enables on modern accelerators.