LLM Quantization Explained: INT4, INT8, FP8, AWQ, and GPTQ in 2026
Quantization is how you run a 70B model on hardware that wasn’t designed for it. It’s also how you double the throughput of a model you’re already running. This guide explains every major quantization format and method — what the abbreviations mean, how much VRAM they save, what quality they sacrifice, and which to use when.
What Is Quantization and Why Does It Matter?
Neural network weights are numbers. By default, those numbers are stored in BF16 (Brain Float 16) — 2 bytes per parameter. A 70B parameter model therefore occupies 140GB in BF16. Quantization reduces the precision of those numbers, compressing the storage requirement.
INT8 uses 1 byte per parameter — halving VRAM. INT4 uses half a byte per parameter — quartering VRAM. A 70B model in INT4 occupies roughly 35–40GB, enabling it to run on a single RTX PRO 6000 Blackwell (96GB) that couldn’t fit it in BF16.
The tradeoff: lower precision means some information is lost. Modern quantization methods are designed to minimize that loss while maximizing compression. The art is in which values get quantized aggressively and which are preserved.
The Formats: What Each One Means
FP32 — Full Precision (Reference)
4 bytes per parameter. Used during training optimizer states. Rarely used for inference in 2026 — no quality benefit over BF16 for inference, double the VRAM.
BF16 — Brain Float 16 (Modern Standard)
2 bytes per parameter. The standard precision for training and full-quality inference on modern hardware. Blackwell, Hopper, and Ada architecture GPUs have native BF16 support with hardware tensor cores. All VRLA Tech AI systems are optimized for BF16 workloads.
FP8 — 8-bit Floating Point
1 byte per parameter. NVIDIA Blackwell and Hopper architectures support FP8 natively in hardware tensor cores, making it both memory-efficient and computationally fast. FP8 inference quality is nearly indistinguishable from BF16 for most tasks — it’s the practical sweet spot between quality and efficiency in 2026. vLLM and TensorRT-LLM both support FP8 inference on Blackwell GPUs.
INT8 — 8-bit Integer
1 byte per parameter. Slightly lower quality than FP8 due to integer quantization, but well-supported across all hardware. bitsandbytes INT8 quantization is a standard method for reducing inference VRAM by 50%. Good for teams with older GPU hardware that doesn’t support FP8.
INT4 — 4-bit Integer
0.5 bytes per parameter. Maximum VRAM savings — 75% reduction vs BF16. Quality degradation is noticeable on complex reasoning and math tasks but acceptable for many production use cases like summarization, classification, and code completion. INT4 is the format that enables 70B models to run on a single GPU.
Quantization Methods: AWQ, GPTQ, GGUF, and More
AWQ — Activation-Aware Weight Quantization
AWQ analyzes activation patterns during calibration to identify the most important weights — those that have the highest impact on output quality — and protects them from aggressive quantization. Less important weights are quantized more aggressively. The result: better quality preservation at INT4 than naive uniform quantization.
AWQ models are the current best-practice INT4 format for vLLM deployment. Most major open-source models have community-provided AWQ versions available. AWQ quantization is also fast to apply to new models.
GPTQ — Generative Pre-Training Quantization
GPTQ quantizes weights layer by layer, minimizing the quantization error within each layer using second-order information about the weight distribution. Produces high-quality INT4 models, slightly lower quality than AWQ in most benchmarks but still widely used due to excellent tool support and availability of pre-quantized models.
GPTQ is supported in vLLM, Hugging Face Transformers, and text-generation-webui. Many Llama, Mistral, and Qwen variants have community GPTQ versions available.
GGUF — GPT-Generated Unified Format
GGUF is the format used by llama.cpp and Ollama. It supports a range of quantization levels (Q2_K through Q8_0) and enables CPU+GPU hybrid inference — the model loads partially into GPU VRAM and partially into system RAM, with compute split between CPU and GPU. GGUF enables running large models on consumer hardware with modest VRAM by offloading layers to system RAM. Quality varies by quantization level: Q5_K_M and Q6_K offer near-BF16 quality; Q2_K shows noticeable degradation.
NF4 — Normalized Float 4-bit
Used in QLoRA fine-tuning. NF4 is an information-theoretically optimal 4-bit data type for normally distributed weights, designed to minimize quantization error for the typical weight distribution of neural networks. Used specifically for the frozen base model weights in QLoRA — not typically used for inference deployment.
VRAM Requirements by Format and Model Size
| Model | BF16 | FP8 | INT8 | INT4 (AWQ/GPTQ) |
|---|---|---|---|---|
| 7B | 14GB | 7GB | 7GB | 4–5GB |
| 13B | 26GB | 13GB | 13GB | 7–8GB |
| 30B | 60GB | 30GB | 30GB | 16–18GB |
| 70B | 140GB | 70GB | 70GB | 36–40GB |
| 180B | 360GB | 180GB | 180GB | 92–100GB |
Add 15–25% for KV cache and framework overhead during inference at typical batch sizes.
Quality Comparison by Format
| Format | Quality vs BF16 | Best For | Notes |
|---|---|---|---|
| FP8 | ~99% | Production inference | Native hardware support on Blackwell/Hopper |
| INT8 | ~97–98% | Production inference | Widely supported; good quality |
| AWQ INT4 | ~94–96% | Production inference, VRAM-constrained | Best INT4 quality; good vLLM support |
| GPTQ INT4 | ~93–95% | Production inference | Wide availability of pre-quantized models |
| GGUF Q5_K_M | ~95–97% | CPU+GPU hybrid, Ollama | Good quality; flexible offloading |
| GGUF Q4_K_M | ~92–94% | Consumer hardware | Most common Ollama format |
| GGUF Q2_K | ~80–85% | Extreme VRAM constraints only | Noticeable quality loss |
Practical recommendation for production on-premise: FP8 if your hardware supports it (Blackwell or Hopper), AWQ INT4 if you need to fit a model in constrained VRAM. Don’t use GGUF for production high-throughput serving — it’s designed for developer workstations and edge deployments, not server throughput.
Choosing the Right Format for Your Hardware
- RTX PRO 6000 Blackwell (96GB): FP8 or BF16 for 7B–30B models; AWQ INT4 for 70B inference on a single card
- 2x RTX PRO 6000 NVLink (192GB): BF16 or FP8 for 70B models; no quantization needed for most workloads
- RTX 5090 (32GB): AWQ INT4 or GGUF for 13B–30B; BF16/FP8 for 7B and smaller
- Developer workstation (24GB): GGUF Q4_K_M or Q5_K_M for local LLM experimentation via Ollama
VRLA Tech LLM servers ship with the quantization stack pre-installed
Our systems come with vLLM, llama.cpp, AWQ and GPTQ quantization tools, and the correct CUDA/driver stack for your GPU — all pre-configured. Run your first quantized inference job on the same day the system arrives.
Need an on-premise LLM inference server configured for your model?
Tell us your model size and throughput requirements. Our engineers will spec the right system with the right quantization strategy.
Frequently Asked Questions
What quantization format should I use for production LLM inference?
FP8 on Blackwell or Hopper hardware for maximum quality and speed. AWQ INT4 for VRAM-constrained deployments. Avoid INT4 for math, code generation, or reasoning-heavy tasks where quality loss is most noticeable.
Can I fine-tune a quantized model?
Not directly. QLoRA uses the quantized model as a frozen base (NF4 format) with LoRA adapters trained in BF16 — effectively fine-tuning around the quantized weights. This is not the same as fine-tuning the quantized weights themselves, which generally degrades quality further.
Is GGUF good for production inference serving?
GGUF (llama.cpp) is excellent for developer workstations and single-user scenarios. For production multi-user inference serving, vLLM with AWQ or FP8 delivers substantially higher throughput. GGUF’s CPU offloading is a tradeoff that limits throughput.




