LLM Quantization Explained: INT4, INT8, FP8, AWQ, and GPTQ in 2026

By VRLA Tech · AI Infrastructure · April 2026

Quantization is how you run a 70B model on hardware that wasn’t designed for it. It’s also how you double the throughput of a model you’re already running. This guide explains every major quantization format and method — what the abbreviations mean, how much VRAM they save, what quality they sacrifice, and which to use when.

What Is Quantization and Why Does It Matter?

Neural network weights are numbers. By default, those numbers are stored in BF16 (Brain Float 16) — 2 bytes per parameter. A 70B parameter model therefore occupies 140GB in BF16. Quantization reduces the precision of those numbers, compressing the storage requirement.

INT8 uses 1 byte per parameter — halving VRAM. INT4 uses half a byte per parameter — quartering VRAM. A 70B model in INT4 occupies roughly 35–40GB, enabling it to run on a single RTX PRO 6000 Blackwell (96GB) that couldn’t fit it in BF16.

The tradeoff: lower precision means some information is lost. Modern quantization methods are designed to minimize that loss while maximizing compression. The art is in which values get quantized aggressively and which are preserved.

The Formats: What Each One Means

FP32 — Full Precision (Reference)

4 bytes per parameter. Used during training optimizer states. Rarely used for inference in 2026 — no quality benefit over BF16 for inference, double the VRAM.

BF16 — Brain Float 16 (Modern Standard)

2 bytes per parameter. The standard precision for training and full-quality inference on modern hardware. Blackwell, Hopper, and Ada architecture GPUs have native BF16 support with hardware tensor cores. All VRLA Tech AI systems are optimized for BF16 workloads.

FP8 — 8-bit Floating Point

1 byte per parameter. NVIDIA Blackwell and Hopper architectures support FP8 natively in hardware tensor cores, making it both memory-efficient and computationally fast. FP8 inference quality is nearly indistinguishable from BF16 for most tasks — it’s the practical sweet spot between quality and efficiency in 2026. vLLM and TensorRT-LLM both support FP8 inference on Blackwell GPUs.

INT8 — 8-bit Integer

1 byte per parameter. Slightly lower quality than FP8 due to integer quantization, but well-supported across all hardware. bitsandbytes INT8 quantization is a standard method for reducing inference VRAM by 50%. Good for teams with older GPU hardware that doesn’t support FP8.

INT4 — 4-bit Integer

0.5 bytes per parameter. Maximum VRAM savings — 75% reduction vs BF16. Quality degradation is noticeable on complex reasoning and math tasks but acceptable for many production use cases like summarization, classification, and code completion. INT4 is the format that enables 70B models to run on a single GPU.

Quantization Methods: AWQ, GPTQ, GGUF, and More

AWQ — Activation-Aware Weight Quantization

AWQ analyzes activation patterns during calibration to identify the most important weights — those that have the highest impact on output quality — and protects them from aggressive quantization. Less important weights are quantized more aggressively. The result: better quality preservation at INT4 than naive uniform quantization.

AWQ models are the current best-practice INT4 format for vLLM deployment. Most major open-source models have community-provided AWQ versions available. AWQ quantization is also fast to apply to new models.

GPTQ — Generative Pre-Training Quantization

GPTQ quantizes weights layer by layer, minimizing the quantization error within each layer using second-order information about the weight distribution. Produces high-quality INT4 models, slightly lower quality than AWQ in most benchmarks but still widely used due to excellent tool support and availability of pre-quantized models.

GPTQ is supported in vLLM, Hugging Face Transformers, and text-generation-webui. Many Llama, Mistral, and Qwen variants have community GPTQ versions available.

GGUF — GPT-Generated Unified Format

GGUF is the format used by llama.cpp and Ollama. It supports a range of quantization levels (Q2_K through Q8_0) and enables CPU+GPU hybrid inference — the model loads partially into GPU VRAM and partially into system RAM, with compute split between CPU and GPU. GGUF enables running large models on consumer hardware with modest VRAM by offloading layers to system RAM. Quality varies by quantization level: Q5_K_M and Q6_K offer near-BF16 quality; Q2_K shows noticeable degradation.

NF4 — Normalized Float 4-bit

Used in QLoRA fine-tuning. NF4 is an information-theoretically optimal 4-bit data type for normally distributed weights, designed to minimize quantization error for the typical weight distribution of neural networks. Used specifically for the frozen base model weights in QLoRA — not typically used for inference deployment.

VRAM Requirements by Format and Model Size

Model	BF16	FP8	INT8	INT4 (AWQ/GPTQ)
7B	14GB	7GB	7GB	4–5GB
13B	26GB	13GB	13GB	7–8GB
30B	60GB	30GB	30GB	16–18GB
70B	140GB	70GB	70GB	36–40GB
180B	360GB	180GB	180GB	92–100GB

Add 15–25% for KV cache and framework overhead during inference at typical batch sizes.

Quality Comparison by Format

Format	Quality vs BF16	Best For	Notes
FP8	~99%	Production inference	Native hardware support on Blackwell/Hopper
INT8	~97–98%	Production inference	Widely supported; good quality
AWQ INT4	~94–96%	Production inference, VRAM-constrained	Best INT4 quality; good vLLM support
GPTQ INT4	~93–95%	Production inference	Wide availability of pre-quantized models
GGUF Q5_K_M	~95–97%	CPU+GPU hybrid, Ollama	Good quality; flexible offloading
GGUF Q4_K_M	~92–94%	Consumer hardware	Most common Ollama format
GGUF Q2_K	~80–85%	Extreme VRAM constraints only	Noticeable quality loss

Practical recommendation for production on-premise: FP8 if your hardware supports it (Blackwell or Hopper), AWQ INT4 if you need to fit a model in constrained VRAM. Don’t use GGUF for production high-throughput serving — it’s designed for developer workstations and edge deployments, not server throughput.

Choosing the Right Format for Your Hardware

RTX PRO 6000 Blackwell (96GB): FP8 or BF16 for 7B–30B models; AWQ INT4 for 70B inference on a single card
2x RTX PRO 6000 NVLink (192GB): BF16 or FP8 for 70B models; no quantization needed for most workloads
RTX 5090 (32GB): AWQ INT4 or GGUF for 13B–30B; BF16/FP8 for 7B and smaller
Developer workstation (24GB): GGUF Q4_K_M or Q5_K_M for local LLM experimentation via Ollama

VRLA Tech LLM servers ship with the quantization stack pre-installed

Our systems come with vLLM, llama.cpp, AWQ and GPTQ quantization tools, and the correct CUDA/driver stack for your GPU — all pre-configured. Run your first quantized inference job on the same day the system arrives.

View LLM server configurations → | Get a quote →

Need an on-premise LLM inference server configured for your model?

Tell us your model size and throughput requirements. Our engineers will spec the right system with the right quantization strategy.

Talk to an engineer →

Frequently Asked Questions

What quantization format should I use for production LLM inference?

FP8 on Blackwell or Hopper hardware for maximum quality and speed. AWQ INT4 for VRAM-constrained deployments. Avoid INT4 for math, code generation, or reasoning-heavy tasks where quality loss is most noticeable.

Can I fine-tune a quantized model?

Not directly. QLoRA uses the quantized model as a frozen base (NF4 format) with LoRA adapters trained in BF16 — effectively fine-tuning around the quantized weights. This is not the same as fine-tuning the quantized weights themselves, which generally degrades quality further.

Is GGUF good for production inference serving?

GGUF (llama.cpp) is excellent for developer workstations and single-user scenarios. For production multi-user inference serving, vLLM with AWQ or FP8 delivers substantially higher throughput. GGUF’s CPU offloading is a tradeoff that limits throughput.

Gaming PCs

Custom Gaming PCs

Special Systems

Accessories

Rackmount Workstations

OEM Workstations

Special Systems

Accessories

Cart review

LLM Quantization Explained: INT4, INT8, FP8, AWQ, and GPTQ in 2026

What Is Quantization and Why Does It Matter?

The Formats: What Each One Means

FP32 — Full Precision (Reference)

BF16 — Brain Float 16 (Modern Standard)

FP8 — 8-bit Floating Point

INT8 — 8-bit Integer

INT4 — 4-bit Integer

Quantization Methods: AWQ, GPTQ, GGUF, and More

AWQ — Activation-Aware Weight Quantization

GPTQ — Generative Pre-Training Quantization

GGUF — GPT-Generated Unified Format

NF4 — Normalized Float 4-bit

VRAM Requirements by Format and Model Size

Quality Comparison by Format

Choosing the Right Format for Your Hardware

VRLA Tech LLM servers ship with the quantization stack pre-installed

Need an on-premise LLM inference server configured for your model?

Frequently Asked Questions

What quantization format should I use for production LLM inference?

Can I fine-tune a quantized model?

Is GGUF good for production inference serving?

Related Reading

Related Posts

Leave a Reply Cancel reply