LLM VRAM Requirements 2026: Every Major Model

Q: How much VRAM does Qwen 3 need?

Qwen 3 VRAM requirements vary by variant. Qwen 3 8B needs approximately 5GB at Q4_K_M. Qwen 3 14B needs approximately 9GB at Q4. Qwen 3 32B needs approximately 19–20GB at Q4, fitting on an RTX 5090. Qwen 3 30B-A3B (MoE) needs only approximately 6GB at Q4. Qwen 3 72B needs approximately 40–45GB at Q4, requiring an RTX PRO 6000 Blackwell. Qwen 3 235B-A22B (MoE) needs approximately 120GB, fitting on 2× RTX PRO 6000 Blackwell (192GB combined).

Q: What is the difference between Q4 and FP16 VRAM requirements?

FP16 stores each model parameter as a 2-byte float. Q4_K_M quantization compresses weights to approximately 0.5 bytes per parameter — roughly a 4× reduction. A 70B model at FP16 requires approximately 140GB. The same model at Q4_K_M requires approximately 38–40GB. Quality loss at Q4_K_M is typically 1–3% on standard benchmarks. Going below Q4 (Q3, Q2) produces noticeable reasoning degradation.

By VRLA Tech · Reference Guide · June 2026 · Last verified: June 2026

VRAM is the single hard constraint in local LLM deployment. If your GPU has 24GB and your model needs 40GB, you get an out-of-memory error or slow CPU offloading. This is the reference table we use when configuring systems — covering every major open-weight LLM in production use as of June 2026 at every quantization level that matters.

How VRAM requirements are calculated

Model weights account for the largest share. FP16 is 2 bytes per parameter. FP8 is 1 byte. Q4_K_M is approximately 0.5 bytes. A 70B model at FP16 requires approximately 140GB. The same model at Q4_K_M requires approximately 38–40GB.

KV cache stores the key-value pairs for every token in the context window. It grows linearly with context length and scales with batch size for multi-user serving. At 4K context, KV cache adds a few GB on top of weights. At 128K context, it can add tens of GB.

MoE models break the simple multiplication rule. Mixture-of-Experts architectures like Llama 4, DeepSeek V3, and Qwen 3 30B-A3B have a total parameter count and a smaller active parameter count. During inference, only the active parameters run per token — but all expert weights must be loaded into VRAM. A model listed as “109B total, 17B active” requires VRAM for all 109B parameters.

Quantization quick reference

Precision	Bytes/Param	Quality vs FP16	Notes
FP16	2.0	Baseline	Full quality; maximum VRAM cost
FP8	1.0	~99%	Supported by Blackwell and Hopper generation GPUs
Q8_0	~1.0	~99%	Software quantization; similar to FP8 in practice
Q4_K_M	~0.5	~97–99%	Optimal balance; recommended for most deployments
Q3_K_M	~0.375	~93–95%	Noticeable quality loss on reasoning tasks
Q2_K	~0.25	~85–90%	Significant degradation; only for extreme VRAM constraints

Meta Llama 4 VRAM requirements

Variant	Architecture	Q4 VRAM	FP8 VRAM	FP16 VRAM	Minimum GPU
Llama 4 Scout	MoE, 16 experts	~55–60 GB	~109 GB	~218 GB	RTX PRO 6000 (96GB) at Q4
Llama 4 Maverick	MoE, 128 experts	~200 GB	~400 GB	~800 GB	4× H100 (320GB) at Q4

Llama 4 Scout is the workstation-deployable variant. At Q4, it fits on a single RTX PRO 6000 Blackwell (96GB) with comfortable headroom. Its 10M-token context window is the largest of any open model in 2026 — note that very long context lengths will consume most of the remaining VRAM beyond model weights.

DeepSeek VRAM requirements

Variant	Architecture	Q4 VRAM	Minimum GPU
DeepSeek-R1-Distill-Llama-8B	Dense	~5 GB	RTX 5090 (32GB)
DeepSeek-R1-Distill-Qwen-14B	Dense	~9 GB	RTX 5090 (32GB)
DeepSeek-R1-Distill-Qwen-32B	Dense	~20 GB	RTX 5090 (32GB)
DeepSeek V3 (671B) at Q4	MoE	~336 GB	4× H200 minimum
DeepSeek V3 (671B) at FP8	MoE	~700 GB	8× H200 (1,128GB)

For most enterprise teams, the 32B distill at Q4 is the practical choice: approximately 20GB VRAM, runs on an RTX 5090, and benchmarks strongly on coding and reasoning tasks.

Qwen 3 VRAM requirements

Variant	Architecture	Q4_K_M VRAM	Q8 VRAM	Minimum GPU
Qwen 3 8B	Dense	~5 GB	~9 GB	RTX 5090 (32GB)
Qwen 3 14B	Dense	~9 GB	~15 GB	RTX 5090 (32GB)
Qwen 3 30B-A3B	MoE	~6 GB	~10 GB	RTX 5090 (32GB)
Qwen 3 32B	Dense	~19–20 GB	~35 GB	RTX 5090 (32GB)
Qwen 3 72B	Dense	~40–45 GB	~75 GB	RTX PRO 6000 (96GB)
Qwen 3 235B-A22B	MoE	~120 GB	~240 GB	2× RTX PRO 6000 (192GB)

Qwen 3 30B-A3B is the most VRAM-efficient serious reasoning model available. Its MoE architecture uses Grouped Query Attention that minimizes KV cache growth — expanding context from 8K to 64K adds only approximately 1.2GB. A 30B-class model that fits on an 8GB GPU is exceptional for edge and embedded deployments.

Gemma 4 VRAM requirements

Variant	Architecture	Q4 VRAM	FP16 VRAM	Minimum GPU
Gemma 4 E2B	MoE	~4 GB	~10 GB	8GB GPU
Gemma 4 E4B	MoE	~6 GB	~16 GB	RTX 3060 12GB
Gemma 4 26B-A4B	MoE	~16–18 GB	~52 GB	RTX 5090 (32GB)
Gemma 4 31B	Dense	~17–20 GB	~62 GB	RTX 5090 (32GB)

All Gemma 4 variants use Apache 2.0 licensing — fully permissive for commercial use with no revenue restrictions.

Llama 3.x VRAM requirements

Variant	Architecture	Q4_K_M VRAM	FP8 VRAM	Minimum GPU
Llama 3.2 3B	Dense	~2.5 GB	~3 GB	8GB GPU
Llama 3.1 8B	Dense	~4.5 GB	~8 GB	RTX 5090 (32GB)
Llama 3.2 11B Vision	Dense	~7 GB	~11 GB	RTX 5090 (32GB)
Llama 3.3 70B	Dense	~38–40 GB	~70 GB	RTX PRO 6000 (96GB)
Llama 3.2 90B Vision	Dense	~50 GB	~90 GB	RTX PRO 6000 (96GB)

GPU tiers and what they run

GPU	VRAM	Models at Q4	Best For
RTX 4060 Ti 16GB	16 GB	Up to ~14B	Entry-level local inference
RTX 5090	32 GB	Up to ~32B	Developer workstations, 7B–32B inference
Dual RTX 5090	64 GB	Up to ~60B (Llama 4 Scout)	Researchers, multi-model setups
RTX PRO 6000 Blackwell	96 GB	Up to 70B at FP8; 70B at Q4 comfortably	Production 70B inference, QLoRA fine-tuning
Dual RTX PRO 6000	192 GB	Up to 235B MoE (Qwen 3 235B-A22B)	Large-model serving, team inference servers
H100 PCIe (80GB)	80 GB	70B at FP8; 40B at FP16	Datacenter inference, training
H200 (141GB)	141 GB	70B at FP16; Llama 4 Scout at FP16	Large-model production serving
B200 (180GB)	180 GB	Up to ~180B dense at FP16	Hyperscale training and serving

VRLA Tech builds systems at every tier in this table. Use our ROI calculator to compare 3-year on-premise cost against cloud GPU spend for your workload.

Not sure which GPU fits your model?

Tell us your model, quantization target, and expected concurrent users. VRLA Tech engineers will recommend the right configuration and send a firm quote within one business day.

Contact the VRLA Tech engineering team →

Custom AI workstations and GPU servers built for LLM inference

Built in Los Angeles since 2016. 3-year parts warranty and lifetime US-based engineer support on every system.

See GPU server configurations →

Ready to buy?

FAQ: LLM VRAM requirements 2026

How much VRAM do I need to run an LLM locally in 2026?

The minimum practical VRAM for running LLMs locally in 2026 depends on the model you want to run. 8GB VRAM runs 7B–8B models at Q4_K_M. 16–24GB runs 14B–32B models at Q4. 48–96GB runs 70B models at Q4 or FP8. For 70B inference on a single GPU, the RTX PRO 6000 Blackwell (96GB) is the only workstation GPU with sufficient VRAM and headroom for KV cache. VRLA Tech builds AI workstations configured for every tier in Los Angeles since 2016. Call 213-810-3013 or visit vrlatech.com.

How much VRAM does Llama 4 Scout need?

Llama 4 Scout (109B total parameters, 17B active MoE) requires approximately 55–60 GB of VRAM at Q4_K_M quantization. This fits on a single RTX PRO 6000 Blackwell (96GB), dual RTX 5090s (64GB combined), or a single H100 80GB at Q4. At FP16, Scout requires approximately 218GB. VRLA Tech builds Scout-ready workstations and servers pre-configured with Ollama, vLLM, and llama.cpp.

How much VRAM does DeepSeek V3 need?

DeepSeek V3 has 671B total parameters and 37B active per token (MoE). At FP8, weights alone require approximately 671GB VRAM. Full production deployment requires 8× H200 141GB (1,128GB combined). DeepSeek-R1-Distill-Qwen-32B needs approximately 20GB at Q4 — fits on an RTX 5090. DeepSeek-R1-Distill-Llama-8B needs approximately 5GB.

How much VRAM does Qwen 3 need?

Qwen 3 8B needs approximately 5GB at Q4_K_M. Qwen 3 14B needs approximately 9GB. Qwen 3 32B needs approximately 19–20GB, fitting on an RTX 5090. Qwen 3 30B-A3B (MoE) needs only approximately 6GB at Q4. Qwen 3 72B needs approximately 40–45GB, requiring an RTX PRO 6000 Blackwell. Qwen 3 235B-A22B (MoE) needs approximately 120GB, fitting on 2× RTX PRO 6000 Blackwell.

What is the difference between Q4 and FP16 VRAM requirements?

FP16 stores each parameter as a 2-byte float. Q4_K_M compresses weights to approximately 0.5 bytes — roughly a 4× reduction. A 70B model at FP16 requires approximately 140GB. The same model at Q4_K_M requires approximately 38–40GB. Quality loss at Q4_K_M is typically 1–3% on standard benchmarks.

Where can I buy a custom AI workstation with enough VRAM to run large LLMs?

VRLA Tech builds custom AI workstations and GPU servers in Los Angeles configured for LLM inference. Systems are available with NVIDIA RTX PRO 6000 Blackwell (96GB), dual RTX PRO 6000 Blackwell (192GB), and multi-GPU EPYC servers scaling to 8× GPU setups. Every system ships with CUDA, PyTorch, vLLM, and Ollama pre-installed. 3-year parts warranty and lifetime US-based engineer support. Visit vrlatech.com or call 213-810-3013.

What GPU runs a 70B model on a single card?

The NVIDIA RTX PRO 6000 Blackwell (96GB ECC GDDR7) is the only workstation GPU that runs a 70B model on a single card with practical KV cache headroom. At Q4_K_M, a 70B model requires approximately 38–40GB, leaving approximately 56GB for context. At FP8, it requires approximately 70GB, leaving approximately 26GB for KV cache at standard context lengths. VRLA Tech builds RTX PRO 6000 Blackwell workstations in Los Angeles with lifetime US-based engineer support.

Built by the VRLA Tech engineering team in Los Angeles. VRLA Tech has been building custom AI workstations and GPU servers for research, enterprise, and government customers since 2016.

Gaming PCs

Custom Gaming PCs

Special Systems

Accessories

Rackmount Workstations

OEM Workstations

Dell Servers

GPU Servers

HPE Servers

Lenovo Servers