Teraflops (TFLOPS) measure floating-point operations per second at the scale of one trillion, serving as a key metric for comparing the raw computational throughput of AI accelerators and GPUs.
Teraflops (TFLOPS, or 10¹² floating-point operations per second) is a unit of computational throughput used to quantify the peak arithmetic capability of processors, GPUs, and AI accelerators. In the context of machine learning, TFLOPS figures are the primary benchmark for comparing hardware and estimating training feasibility for large models.
Definition and Scale
| Unit | Operations per Second |
|---|---|
| GFLOPS (Gigaflops) | 10⁹ |
| TFLOPS (Teraflops) | 10¹² |
| PFLOPS (Petaflops) | 10¹⁵ |
| EFLOPS (Exaflops) | 10¹⁸ |
Modern AI accelerators are commonly rated in tens to hundreds of TFLOPS at standard precisions, or thousands of TFLOPS for sparse or integer operations.
Precision and TFLOPS
Hardware vendors publish multiple TFLOPS figures depending on numeric format:
FP64 (double precision)
Used in scientific simulations; typically the lowest figure, as it requires the most hardware resources per operation.
FP32 (single precision)
Traditional default for neural network training; most GPUs are rated at 2–10× their FP64 throughput here.
BF16 / FP16 (half precision)
The dominant format for LLM training. Tensor Cores and matrix engines deliver 2–4× FP32 throughput in these formats.
INT8 / FP8 (quantized)
Used primarily for inference. Modern accelerators claim 4–8× the FP16 figure in INT8/FP8, though utilization in practice is lower.
Sparse TFLOPS
Some vendors (e.g., NVIDIA with A100/H100) advertise 2× additional throughput via structured sparsity, contingent on model weights having ≥50% sparsity.
TFLOPS in AI Hardware
Representative peak figures (FP16/BF16, dense):
- NVIDIA H100 SXM: ~1,979 TFLOPS (FP8: ~3,958 TFLOPS)
- NVIDIA A100 80GB: ~312 TFLOPS (TF32 Tensor Core)
- Google TPU v4: ~275 TFLOPS (BF16)
- AMD MI300X: ~1,307 TFLOPS (FP16)
- Apple M4 Max: ~273 TOPS (INT8, Neural Engine)
TFLOPS and Training Cost
The computational cost of training a neural network is often estimated using the formula:
C ≈ 6 × N × D
where C is total FLOPs, N is the number of model parameters, and D is the number of training tokens (Chinchilla scaling). Dividing C by the hardware's achieved TFLOPS (after accounting for hardware utilization, or MFU) gives an estimate of training time.
For example, training a 70B parameter model on 1.4T tokens requires roughly 5.9 × 10²³ FLOPs (~590 zettaFLOPs). At 50% MFU on a cluster of 1,024 H100s each delivering 1,979 TFLOPS, training would take approximately 29 days.
TFLOPS vs. Real-World Performance
Peak TFLOPS figures are theoretical maximums rarely achieved in practice. Key gaps include:
- Memory bandwidth bottleneck: Many operations are memory-bound, not compute-bound.
- Operator fusion and kernel efficiency: Unoptimized kernels may use <30% of peak TFLOPS.
- Communication overhead: In multi-GPU training, all-reduce operations consume wall-clock time not counted in compute.
- Batch size effects: Small batch sizes reduce tensor core utilization.
Model FLOP Utilization (MFU) and Hardware FLOP Utilization (HFU) are metrics that track the fraction of peak TFLOPS actually used, and are more informative than raw TFLOPS for benchmarking.
Teraflops vs. Tokens per Second
For inference workloads, tokens per second is often a more practical metric than TFLOPS because it directly reflects user-facing latency and throughput. However, TFLOPS remain essential for capacity planning and for comparing hardware before deployment.
Teraflops are the lingua franca of AI hardware comparison, but interpreting them correctly requires attention to precision format, sparsity assumptions, and the gap between peak and sustained throughput in real workloads.