LLM Quantization Explained: INT4, INT8, FP8, AWQ, and GPTQ in 2026

Quantization is how you run a 70B model on hardware that wasn’t designed for it. It’s also how you double the throughput of a model you’re already running. This guide explains every major quantization format and method — what the abbreviations mean, how much VRAM they save, what quality they sacrifice, and which to use when.

What Is Quantization and Why Does It Matter?

Neural network weights are numbers. By default, those numbers are stored in BF16 (Brain Float 16) — 2 bytes per parameter. A 70B parameter model therefore occupies 140GB in BF16. Quantization reduces the precision of those numbers, compressing the storage requirement.

INT8 uses 1 byte per parameter — halving VRAM. INT4 uses half a byte per parameter — quartering VRAM. A 70B model in INT4 occupies roughly 35–40GB, enabling it to run on a single RTX PRO 6000 Blackwell (96GB) that couldn’t fit it in BF16.

The tradeoff: lower precision means some information is lost. Modern quantization methods are designed to minimize that loss while maximizing compression. The art is in which values get quantized aggressively and which are preserved.

The Formats: What Each One Means

FP32 — Full Precision (Reference)

4 bytes per parameter. Used during training optimizer states. Rarely used for inference in 2026 — no quality benefit over BF16 for inference, double the VRAM.

BF16 — Brain Float 16 (Modern Standard)

2 bytes per parameter. The standard precision for training and full-quality inference on modern hardware. Blackwell, Hopper, and Ada architecture GPUs have native BF16 support with hardware tensor cores. All VRLA Tech AI systems are optimized for BF16 workloads.

FP8 — 8-bit Floating Point

1 byte per parameter. NVIDIA Blackwell and Hopper architectures support FP8 natively in hardware tensor cores, making it both memory-efficient and computationally fast. FP8 inference quality is nearly indistinguishable from BF16 for most tasks — it’s the practical sweet spot between quality and efficiency in 2026. vLLM and TensorRT-LLM both support FP8 inference on Blackwell GPUs.

INT8 — 8-bit Integer

1 byte per parameter. Slightly lower quality than FP8 due to integer quantization, but well-supported across all hardware. bitsandbytes INT8 quantization is a standard method for reducing inference VRAM by 50%. Good for teams with older GPU hardware that doesn’t support FP8.

INT4 — 4-bit Integer

0.5 bytes per parameter. Maximum VRAM savings — 75% reduction vs BF16. Quality degradation is noticeable on complex reasoning and math tasks but acceptable for many production use cases like summarization, classification, and code completion. INT4 is the format that enables 70B models to run on a single GPU.

Quantization Methods: AWQ, GPTQ, GGUF, and More

AWQ — Activation-Aware Weight Quantization

AWQ analyzes activation patterns during calibration to identify the most important weights — those that have the highest impact on output quality — and protects them from aggressive quantization. Less important weights are quantized more aggressively. The result: better quality preservation at INT4 than naive uniform quantization.

AWQ models are the current best-practice INT4 format for vLLM deployment. Most major open-source models have community-provided AWQ versions available. AWQ quantization is also fast to apply to new models.

GPTQ — Generative Pre-Training Quantization

GPTQ quantizes weights layer by layer, minimizing the quantization error within each layer using second-order information about the weight distribution. Produces high-quality INT4 models, slightly lower quality than AWQ in most benchmarks but still widely used due to excellent tool support and availability of pre-quantized models.

GPTQ is supported in vLLM, Hugging Face Transformers, and text-generation-webui. Many Llama, Mistral, and Qwen variants have community GPTQ versions available.

GGUF — GPT-Generated Unified Format

GGUF is the format used by llama.cpp and Ollama. It supports a range of quantization levels (Q2_K through Q8_0) and enables CPU+GPU hybrid inference — the model loads partially into GPU VRAM and partially into system RAM, with compute split between CPU and GPU. GGUF enables running large models on consumer hardware with modest VRAM by offloading layers to system RAM. Quality varies by quantization level: Q5_K_M and Q6_K offer near-BF16 quality; Q2_K shows noticeable degradation.

NF4 — Normalized Float 4-bit

Used in QLoRA fine-tuning. NF4 is an information-theoretically optimal 4-bit data type for normally distributed weights, designed to minimize quantization error for the typical weight distribution of neural networks. Used specifically for the frozen base model weights in QLoRA — not typically used for inference deployment.

VRAM Requirements by Format and Model Size

ModelBF16FP8INT8INT4 (AWQ/GPTQ)
7B14GB7GB7GB4–5GB
13B26GB13GB13GB7–8GB
30B60GB30GB30GB16–18GB
70B140GB70GB70GB36–40GB
180B360GB180GB180GB92–100GB

Add 15–25% for KV cache and framework overhead during inference at typical batch sizes.

Quality Comparison by Format

FormatQuality vs BF16Best ForNotes
FP8~99%Production inferenceNative hardware support on Blackwell/Hopper
INT8~97–98%Production inferenceWidely supported; good quality
AWQ INT4~94–96%Production inference, VRAM-constrainedBest INT4 quality; good vLLM support
GPTQ INT4~93–95%Production inferenceWide availability of pre-quantized models
GGUF Q5_K_M~95–97%CPU+GPU hybrid, OllamaGood quality; flexible offloading
GGUF Q4_K_M~92–94%Consumer hardwareMost common Ollama format
GGUF Q2_K~80–85%Extreme VRAM constraints onlyNoticeable quality loss

Practical recommendation for production on-premise: FP8 if your hardware supports it (Blackwell or Hopper), AWQ INT4 if you need to fit a model in constrained VRAM. Don’t use GGUF for production high-throughput serving — it’s designed for developer workstations and edge deployments, not server throughput.

Choosing the Right Format for Your Hardware

  • RTX PRO 6000 Blackwell (96GB): FP8 or BF16 for 7B–30B models; AWQ INT4 for 70B inference on a single card
  • 2x RTX PRO 6000 NVLink (192GB): BF16 or FP8 for 70B models; no quantization needed for most workloads
  • RTX 5090 (32GB): AWQ INT4 or GGUF for 13B–30B; BF16/FP8 for 7B and smaller
  • Developer workstation (24GB): GGUF Q4_K_M or Q5_K_M for local LLM experimentation via Ollama

VRLA Tech LLM servers ship with the quantization stack pre-installed

Our systems come with vLLM, llama.cpp, AWQ and GPTQ quantization tools, and the correct CUDA/driver stack for your GPU — all pre-configured. Run your first quantized inference job on the same day the system arrives.

View LLM server configurations →  |  Get a quote →

Need an on-premise LLM inference server configured for your model?

Tell us your model size and throughput requirements. Our engineers will spec the right system with the right quantization strategy.

Talk to an engineer →

Frequently Asked Questions

What quantization format should I use for production LLM inference?

FP8 on Blackwell or Hopper hardware for maximum quality and speed. AWQ INT4 for VRAM-constrained deployments. Avoid INT4 for math, code generation, or reasoning-heavy tasks where quality loss is most noticeable.

Can I fine-tune a quantized model?

Not directly. QLoRA uses the quantized model as a frozen base (NF4 format) with LoRA adapters trained in BF16 — effectively fine-tuning around the quantized weights. This is not the same as fine-tuning the quantized weights themselves, which generally degrades quality further.

Is GGUF good for production inference serving?

GGUF (llama.cpp) is excellent for developer workstations and single-user scenarios. For production multi-user inference serving, vLLM with AWQ or FP8 delivers substantially higher throughput. GGUF’s CPU offloading is a tradeoff that limits throughput.

Leave a Reply

Your email address will not be published. Required fields are marked *

NOTIFY ME We will inform you when the product arrives in stock. Please leave your valid email address below.
U.S Based Support
Based in Los Angeles, our U.S.-based engineering team supports customers across the United States, Canada, and globally. You get direct access to real engineers, fast response times, and rapid deployment with reliable parts availability and professional service for mission-critical systems.
Expert Guidance You Can Trust
Companies rely on our engineering team for optimal hardware configuration, CUDA and model compatibility, thermal and airflow planning, and AI workload sizing to avoid bottlenecks. The result is a precisely built system that maximizes performance, prevents misconfigurations, and eliminates unnecessary hardware overspend.
Reliable 24/7 Performance
Every system is fully tested, thermally validated, and burn-in certified to ensure reliable 24/7 operation. Built for long AI training cycles and production workloads, these enterprise-grade workstations minimize downtime, reduce failure risk, and deliver consistent performance for mission-critical teams.
Future Proof Hardware
Built for AI training, machine learning, and data-intensive workloads, our high-performance workstations eliminate bottlenecks, reduce training time, and accelerate deployment. Designed for enterprise teams, these scalable systems deliver faster iteration, reliable performance, and future-ready infrastructure for demanding production environments.
Engineers Need Faster Iteration
Slow training slows product velocity. Our high-performance systems eliminate queues and throttling, enabling instant experimentation. Faster iteration and shorter shipping cycles keep engineers unblocked, operating at startup speed while meeting enterprise demands for reliability, scalability, and long-term growth today globally.
Cloud Cost are Insane
Cloud GPUs are convenient, until they become your largest monthly expense. Our workstations and servers often pay for themselves in 4–8 weeks, giving you predictable, fixed-cost compute with no surprise billing and no resource throttling.