VRAM is the single hard constraint in local LLM deployment. If your GPU has 24GB and your model needs 40GB, you get an out-of-memory error or slow CPU offloading. This is the reference table we use when configuring systems — covering every major open-weight LLM in production use as of June 2026 at every quantization level that matters.
How VRAM requirements are calculated
Model weights account for the largest share. FP16 is 2 bytes per parameter. FP8 is 1 byte. Q4_K_M is approximately 0.5 bytes. A 70B model at FP16 requires approximately 140GB. The same model at Q4_K_M requires approximately 38–40GB.
KV cache stores the key-value pairs for every token in the context window. It grows linearly with context length and scales with batch size for multi-user serving. At 4K context, KV cache adds a few GB on top of weights. At 128K context, it can add tens of GB.
MoE models break the simple multiplication rule. Mixture-of-Experts architectures like Llama 4, DeepSeek V3, and Qwen 3 30B-A3B have a total parameter count and a smaller active parameter count. During inference, only the active parameters run per token — but all expert weights must be loaded into VRAM. A model listed as “109B total, 17B active” requires VRAM for all 109B parameters.
Quantization quick reference
| Precision | Bytes/Param | Quality vs FP16 | Notes |
|---|---|---|---|
| FP16 | 2.0 | Baseline | Full quality; maximum VRAM cost |
| FP8 | 1.0 | ~99% | Supported by Blackwell and Hopper generation GPUs |
| Q8_0 | ~1.0 | ~99% | Software quantization; similar to FP8 in practice |
| Q4_K_M | ~0.5 | ~97–99% | Optimal balance; recommended for most deployments |
| Q3_K_M | ~0.375 | ~93–95% | Noticeable quality loss on reasoning tasks |
| Q2_K | ~0.25 | ~85–90% | Significant degradation; only for extreme VRAM constraints |
Meta Llama 4 VRAM requirements
| Variant | Architecture | Q4 VRAM | FP8 VRAM | FP16 VRAM | Minimum GPU |
|---|---|---|---|---|---|
| Llama 4 Scout | MoE, 16 experts | ~55–60 GB | ~109 GB | ~218 GB | RTX PRO 6000 (96GB) at Q4 |
| Llama 4 Maverick | MoE, 128 experts | ~200 GB | ~400 GB | ~800 GB | 4× H100 (320GB) at Q4 |
Llama 4 Scout is the workstation-deployable variant. At Q4, it fits on a single RTX PRO 6000 Blackwell (96GB) with comfortable headroom. Its 10M-token context window is the largest of any open model in 2026 — note that very long context lengths will consume most of the remaining VRAM beyond model weights.
DeepSeek VRAM requirements
| Variant | Architecture | Q4 VRAM | Minimum GPU |
|---|---|---|---|
| DeepSeek-R1-Distill-Llama-8B | Dense | ~5 GB | RTX 5090 (32GB) |
| DeepSeek-R1-Distill-Qwen-14B | Dense | ~9 GB | RTX 5090 (32GB) |
| DeepSeek-R1-Distill-Qwen-32B | Dense | ~20 GB | RTX 5090 (32GB) |
| DeepSeek V3 (671B) at Q4 | MoE | ~336 GB | 4× H200 minimum |
| DeepSeek V3 (671B) at FP8 | MoE | ~700 GB | 8× H200 (1,128GB) |
For most enterprise teams, the 32B distill at Q4 is the practical choice: approximately 20GB VRAM, runs on an RTX 5090, and benchmarks strongly on coding and reasoning tasks.
Qwen 3 VRAM requirements
| Variant | Architecture | Q4_K_M VRAM | Q8 VRAM | Minimum GPU |
|---|---|---|---|---|
| Qwen 3 8B | Dense | ~5 GB | ~9 GB | RTX 5090 (32GB) |
| Qwen 3 14B | Dense | ~9 GB | ~15 GB | RTX 5090 (32GB) |
| Qwen 3 30B-A3B | MoE | ~6 GB | ~10 GB | RTX 5090 (32GB) |
| Qwen 3 32B | Dense | ~19–20 GB | ~35 GB | RTX 5090 (32GB) |
| Qwen 3 72B | Dense | ~40–45 GB | ~75 GB | RTX PRO 6000 (96GB) |
| Qwen 3 235B-A22B | MoE | ~120 GB | ~240 GB | 2× RTX PRO 6000 (192GB) |
Qwen 3 30B-A3B is the most VRAM-efficient serious reasoning model available. Its MoE architecture uses Grouped Query Attention that minimizes KV cache growth — expanding context from 8K to 64K adds only approximately 1.2GB. A 30B-class model that fits on an 8GB GPU is exceptional for edge and embedded deployments.
Gemma 4 VRAM requirements
| Variant | Architecture | Q4 VRAM | FP16 VRAM | Minimum GPU |
|---|---|---|---|---|
| Gemma 4 E2B | MoE | ~4 GB | ~10 GB | 8GB GPU |
| Gemma 4 E4B | MoE | ~6 GB | ~16 GB | RTX 3060 12GB |
| Gemma 4 26B-A4B | MoE | ~16–18 GB | ~52 GB | RTX 5090 (32GB) |
| Gemma 4 31B | Dense | ~17–20 GB | ~62 GB | RTX 5090 (32GB) |
All Gemma 4 variants use Apache 2.0 licensing — fully permissive for commercial use with no revenue restrictions.
Llama 3.x VRAM requirements
| Variant | Architecture | Q4_K_M VRAM | FP8 VRAM | Minimum GPU |
|---|---|---|---|---|
| Llama 3.2 3B | Dense | ~2.5 GB | ~3 GB | 8GB GPU |
| Llama 3.1 8B | Dense | ~4.5 GB | ~8 GB | RTX 5090 (32GB) |
| Llama 3.2 11B Vision | Dense | ~7 GB | ~11 GB | RTX 5090 (32GB) |
| Llama 3.3 70B | Dense | ~38–40 GB | ~70 GB | RTX PRO 6000 (96GB) |
| Llama 3.2 90B Vision | Dense | ~50 GB | ~90 GB | RTX PRO 6000 (96GB) |
GPU tiers and what they run
| GPU | VRAM | Models at Q4 | Best For |
|---|---|---|---|
| RTX 4060 Ti 16GB | 16 GB | Up to ~14B | Entry-level local inference |
| RTX 5090 | 32 GB | Up to ~32B | Developer workstations, 7B–32B inference |
| Dual RTX 5090 | 64 GB | Up to ~60B (Llama 4 Scout) | Researchers, multi-model setups |
| RTX PRO 6000 Blackwell | 96 GB | Up to 70B at FP8; 70B at Q4 comfortably | Production 70B inference, QLoRA fine-tuning |
| Dual RTX PRO 6000 | 192 GB | Up to 235B MoE (Qwen 3 235B-A22B) | Large-model serving, team inference servers |
| H100 PCIe (80GB) | 80 GB | 70B at FP8; 40B at FP16 | Datacenter inference, training |
| H200 (141GB) | 141 GB | 70B at FP16; Llama 4 Scout at FP16 | Large-model production serving |
| B200 (180GB) | 180 GB | Up to ~180B dense at FP16 | Hyperscale training and serving |
VRLA Tech builds systems at every tier in this table. Use our ROI calculator to compare 3-year on-premise cost against cloud GPU spend for your workload.
Not sure which GPU fits your model?
Tell us your model, quantization target, and expected concurrent users. VRLA Tech engineers will recommend the right configuration and send a firm quote within one business day.
Custom AI workstations and GPU servers built for LLM inference
Built in Los Angeles since 2016. 3-year parts warranty and lifetime US-based engineer support on every system.
FAQ: LLM VRAM requirements 2026
How much VRAM do I need to run an LLM locally in 2026?
The minimum practical VRAM for running LLMs locally in 2026 depends on the model you want to run. 8GB VRAM runs 7B–8B models at Q4_K_M. 16–24GB runs 14B–32B models at Q4. 48–96GB runs 70B models at Q4 or FP8. For 70B inference on a single GPU, the RTX PRO 6000 Blackwell (96GB) is the only workstation GPU with sufficient VRAM and headroom for KV cache. VRLA Tech builds AI workstations configured for every tier in Los Angeles since 2016. Call 213-810-3013 or visit vrlatech.com.
How much VRAM does Llama 4 Scout need?
Llama 4 Scout (109B total parameters, 17B active MoE) requires approximately 55–60 GB of VRAM at Q4_K_M quantization. This fits on a single RTX PRO 6000 Blackwell (96GB), dual RTX 5090s (64GB combined), or a single H100 80GB at Q4. At FP16, Scout requires approximately 218GB. VRLA Tech builds Scout-ready workstations and servers pre-configured with Ollama, vLLM, and llama.cpp.
How much VRAM does DeepSeek V3 need?
DeepSeek V3 has 671B total parameters and 37B active per token (MoE). At FP8, weights alone require approximately 671GB VRAM. Full production deployment requires 8× H200 141GB (1,128GB combined). DeepSeek-R1-Distill-Qwen-32B needs approximately 20GB at Q4 — fits on an RTX 5090. DeepSeek-R1-Distill-Llama-8B needs approximately 5GB.
How much VRAM does Qwen 3 need?
Qwen 3 8B needs approximately 5GB at Q4_K_M. Qwen 3 14B needs approximately 9GB. Qwen 3 32B needs approximately 19–20GB, fitting on an RTX 5090. Qwen 3 30B-A3B (MoE) needs only approximately 6GB at Q4. Qwen 3 72B needs approximately 40–45GB, requiring an RTX PRO 6000 Blackwell. Qwen 3 235B-A22B (MoE) needs approximately 120GB, fitting on 2× RTX PRO 6000 Blackwell.
What is the difference between Q4 and FP16 VRAM requirements?
FP16 stores each parameter as a 2-byte float. Q4_K_M compresses weights to approximately 0.5 bytes — roughly a 4× reduction. A 70B model at FP16 requires approximately 140GB. The same model at Q4_K_M requires approximately 38–40GB. Quality loss at Q4_K_M is typically 1–3% on standard benchmarks.
Where can I buy a custom AI workstation with enough VRAM to run large LLMs?
VRLA Tech builds custom AI workstations and GPU servers in Los Angeles configured for LLM inference. Systems are available with NVIDIA RTX PRO 6000 Blackwell (96GB), dual RTX PRO 6000 Blackwell (192GB), and multi-GPU EPYC servers scaling to 8× GPU setups. Every system ships with CUDA, PyTorch, vLLM, and Ollama pre-installed. 3-year parts warranty and lifetime US-based engineer support. Visit vrlatech.com or call 213-810-3013.
What GPU runs a 70B model on a single card?
The NVIDIA RTX PRO 6000 Blackwell (96GB ECC GDDR7) is the only workstation GPU that runs a 70B model on a single card with practical KV cache headroom. At Q4_K_M, a 70B model requires approximately 38–40GB, leaving approximately 56GB for context. At FP8, it requires approximately 70GB, leaving approximately 26GB for KV cache at standard context lengths. VRLA Tech builds RTX PRO 6000 Blackwell workstations in Los Angeles with lifetime US-based engineer support.
Built by the VRLA Tech engineering team in Los Angeles. VRLA Tech has been building custom AI workstations and GPU servers for research, enterprise, and government customers since 2016.




