VRAM is the single hard constraint in local LLM deployment. If your GPU has 24GB and your model needs 40GB, you get an out-of-memory error or slow CPU offloading. This is the reference table we use when configuring systems — covering every major open-weight LLM in production use as of June 2026 at every quantization level that matters.


How VRAM requirements are calculated

Model weights account for the largest share. FP16 is 2 bytes per parameter. FP8 is 1 byte. Q4_K_M is approximately 0.5 bytes. A 70B model at FP16 requires approximately 140GB. The same model at Q4_K_M requires approximately 38–40GB.

KV cache stores the key-value pairs for every token in the context window. It grows linearly with context length and scales with batch size for multi-user serving. At 4K context, KV cache adds a few GB on top of weights. At 128K context, it can add tens of GB.

MoE models break the simple multiplication rule. Mixture-of-Experts architectures like Llama 4, DeepSeek V3, and Qwen 3 30B-A3B have a total parameter count and a smaller active parameter count. During inference, only the active parameters run per token — but all expert weights must be loaded into VRAM. A model listed as “109B total, 17B active” requires VRAM for all 109B parameters.

Quantization quick reference

PrecisionBytes/ParamQuality vs FP16Notes
FP162.0BaselineFull quality; maximum VRAM cost
FP81.0~99%Supported by Blackwell and Hopper generation GPUs
Q8_0~1.0~99%Software quantization; similar to FP8 in practice
Q4_K_M~0.5~97–99%Optimal balance; recommended for most deployments
Q3_K_M~0.375~93–95%Noticeable quality loss on reasoning tasks
Q2_K~0.25~85–90%Significant degradation; only for extreme VRAM constraints

Meta Llama 4 VRAM requirements

VariantArchitectureQ4 VRAMFP8 VRAMFP16 VRAMMinimum GPU
Llama 4 ScoutMoE, 16 experts~55–60 GB~109 GB~218 GBRTX PRO 6000 (96GB) at Q4
Llama 4 MaverickMoE, 128 experts~200 GB~400 GB~800 GB4× H100 (320GB) at Q4

Llama 4 Scout is the workstation-deployable variant. At Q4, it fits on a single RTX PRO 6000 Blackwell (96GB) with comfortable headroom. Its 10M-token context window is the largest of any open model in 2026 — note that very long context lengths will consume most of the remaining VRAM beyond model weights.


DeepSeek VRAM requirements

VariantArchitectureQ4 VRAMMinimum GPU
DeepSeek-R1-Distill-Llama-8BDense~5 GBRTX 5090 (32GB)
DeepSeek-R1-Distill-Qwen-14BDense~9 GBRTX 5090 (32GB)
DeepSeek-R1-Distill-Qwen-32BDense~20 GBRTX 5090 (32GB)
DeepSeek V3 (671B) at Q4MoE~336 GB4× H200 minimum
DeepSeek V3 (671B) at FP8MoE~700 GB8× H200 (1,128GB)

For most enterprise teams, the 32B distill at Q4 is the practical choice: approximately 20GB VRAM, runs on an RTX 5090, and benchmarks strongly on coding and reasoning tasks.


Qwen 3 VRAM requirements

VariantArchitectureQ4_K_M VRAMQ8 VRAMMinimum GPU
Qwen 3 8BDense~5 GB~9 GBRTX 5090 (32GB)
Qwen 3 14BDense~9 GB~15 GBRTX 5090 (32GB)
Qwen 3 30B-A3BMoE~6 GB~10 GBRTX 5090 (32GB)
Qwen 3 32BDense~19–20 GB~35 GBRTX 5090 (32GB)
Qwen 3 72BDense~40–45 GB~75 GBRTX PRO 6000 (96GB)
Qwen 3 235B-A22BMoE~120 GB~240 GB2× RTX PRO 6000 (192GB)

Qwen 3 30B-A3B is the most VRAM-efficient serious reasoning model available. Its MoE architecture uses Grouped Query Attention that minimizes KV cache growth — expanding context from 8K to 64K adds only approximately 1.2GB. A 30B-class model that fits on an 8GB GPU is exceptional for edge and embedded deployments.


Gemma 4 VRAM requirements

VariantArchitectureQ4 VRAMFP16 VRAMMinimum GPU
Gemma 4 E2BMoE~4 GB~10 GB8GB GPU
Gemma 4 E4BMoE~6 GB~16 GBRTX 3060 12GB
Gemma 4 26B-A4BMoE~16–18 GB~52 GBRTX 5090 (32GB)
Gemma 4 31BDense~17–20 GB~62 GBRTX 5090 (32GB)

All Gemma 4 variants use Apache 2.0 licensing — fully permissive for commercial use with no revenue restrictions.


Llama 3.x VRAM requirements

VariantArchitectureQ4_K_M VRAMFP8 VRAMMinimum GPU
Llama 3.2 3BDense~2.5 GB~3 GB8GB GPU
Llama 3.1 8BDense~4.5 GB~8 GBRTX 5090 (32GB)
Llama 3.2 11B VisionDense~7 GB~11 GBRTX 5090 (32GB)
Llama 3.3 70BDense~38–40 GB~70 GBRTX PRO 6000 (96GB)
Llama 3.2 90B VisionDense~50 GB~90 GBRTX PRO 6000 (96GB)

GPU tiers and what they run

GPUVRAMModels at Q4Best For
RTX 4060 Ti 16GB16 GBUp to ~14BEntry-level local inference
RTX 509032 GBUp to ~32BDeveloper workstations, 7B–32B inference
Dual RTX 509064 GBUp to ~60B (Llama 4 Scout)Researchers, multi-model setups
RTX PRO 6000 Blackwell96 GBUp to 70B at FP8; 70B at Q4 comfortablyProduction 70B inference, QLoRA fine-tuning
Dual RTX PRO 6000192 GBUp to 235B MoE (Qwen 3 235B-A22B)Large-model serving, team inference servers
H100 PCIe (80GB)80 GB70B at FP8; 40B at FP16Datacenter inference, training
H200 (141GB)141 GB70B at FP16; Llama 4 Scout at FP16Large-model production serving
B200 (180GB)180 GBUp to ~180B dense at FP16Hyperscale training and serving

VRLA Tech builds systems at every tier in this table. Use our ROI calculator to compare 3-year on-premise cost against cloud GPU spend for your workload.

Not sure which GPU fits your model?

Tell us your model, quantization target, and expected concurrent users. VRLA Tech engineers will recommend the right configuration and send a firm quote within one business day.

Contact the VRLA Tech engineering team →


Custom AI workstations and GPU servers built for LLM inference

Built in Los Angeles since 2016. 3-year parts warranty and lifetime US-based engineer support on every system.

See GPU server configurations →

Ready to buy?

FAQ: LLM VRAM requirements 2026

How much VRAM do I need to run an LLM locally in 2026?

The minimum practical VRAM for running LLMs locally in 2026 depends on the model you want to run. 8GB VRAM runs 7B–8B models at Q4_K_M. 16–24GB runs 14B–32B models at Q4. 48–96GB runs 70B models at Q4 or FP8. For 70B inference on a single GPU, the RTX PRO 6000 Blackwell (96GB) is the only workstation GPU with sufficient VRAM and headroom for KV cache. VRLA Tech builds AI workstations configured for every tier in Los Angeles since 2016. Call 213-810-3013 or visit vrlatech.com.

How much VRAM does Llama 4 Scout need?

Llama 4 Scout (109B total parameters, 17B active MoE) requires approximately 55–60 GB of VRAM at Q4_K_M quantization. This fits on a single RTX PRO 6000 Blackwell (96GB), dual RTX 5090s (64GB combined), or a single H100 80GB at Q4. At FP16, Scout requires approximately 218GB. VRLA Tech builds Scout-ready workstations and servers pre-configured with Ollama, vLLM, and llama.cpp.

How much VRAM does DeepSeek V3 need?

DeepSeek V3 has 671B total parameters and 37B active per token (MoE). At FP8, weights alone require approximately 671GB VRAM. Full production deployment requires 8× H200 141GB (1,128GB combined). DeepSeek-R1-Distill-Qwen-32B needs approximately 20GB at Q4 — fits on an RTX 5090. DeepSeek-R1-Distill-Llama-8B needs approximately 5GB.

How much VRAM does Qwen 3 need?

Qwen 3 8B needs approximately 5GB at Q4_K_M. Qwen 3 14B needs approximately 9GB. Qwen 3 32B needs approximately 19–20GB, fitting on an RTX 5090. Qwen 3 30B-A3B (MoE) needs only approximately 6GB at Q4. Qwen 3 72B needs approximately 40–45GB, requiring an RTX PRO 6000 Blackwell. Qwen 3 235B-A22B (MoE) needs approximately 120GB, fitting on 2× RTX PRO 6000 Blackwell.

What is the difference between Q4 and FP16 VRAM requirements?

FP16 stores each parameter as a 2-byte float. Q4_K_M compresses weights to approximately 0.5 bytes — roughly a 4× reduction. A 70B model at FP16 requires approximately 140GB. The same model at Q4_K_M requires approximately 38–40GB. Quality loss at Q4_K_M is typically 1–3% on standard benchmarks.

Where can I buy a custom AI workstation with enough VRAM to run large LLMs?

VRLA Tech builds custom AI workstations and GPU servers in Los Angeles configured for LLM inference. Systems are available with NVIDIA RTX PRO 6000 Blackwell (96GB), dual RTX PRO 6000 Blackwell (192GB), and multi-GPU EPYC servers scaling to 8× GPU setups. Every system ships with CUDA, PyTorch, vLLM, and Ollama pre-installed. 3-year parts warranty and lifetime US-based engineer support. Visit vrlatech.com or call 213-810-3013.

What GPU runs a 70B model on a single card?

The NVIDIA RTX PRO 6000 Blackwell (96GB ECC GDDR7) is the only workstation GPU that runs a 70B model on a single card with practical KV cache headroom. At Q4_K_M, a 70B model requires approximately 38–40GB, leaving approximately 56GB for context. At FP8, it requires approximately 70GB, leaving approximately 26GB for KV cache at standard context lengths. VRLA Tech builds RTX PRO 6000 Blackwell workstations in Los Angeles with lifetime US-based engineer support.


Built by the VRLA Tech engineering team in Los Angeles. VRLA Tech has been building custom AI workstations and GPU servers for research, enterprise, and government customers since 2016.

Leave a Reply

Your email address will not be published. Required fields are marked *

NOTIFY ME We will inform you when the product arrives in stock. Please leave your valid email address below.
U.S Based Support
Based in Los Angeles, our U.S.-based engineering team supports customers across the United States, Canada, and globally. You get direct access to real engineers, fast response times, and rapid deployment with reliable parts availability and professional service for mission-critical systems.
Expert Guidance You Can Trust
Companies rely on our engineering team for optimal hardware configuration, CUDA and model compatibility, thermal and airflow planning, and AI workload sizing to avoid bottlenecks. The result is a precisely built system that maximizes performance, prevents misconfigurations, and eliminates unnecessary hardware overspend.
Reliable 24/7 Performance
Every system is fully tested, thermally validated, and burn-in certified to ensure reliable 24/7 operation. Built for long AI training cycles and production workloads, these enterprise-grade workstations minimize downtime, reduce failure risk, and deliver consistent performance for mission-critical teams.
Future Proof Hardware
Built for AI training, machine learning, and data-intensive workloads, our high-performance workstations eliminate bottlenecks, reduce training time, and accelerate deployment. Designed for enterprise teams, these scalable systems deliver faster iteration, reliable performance, and future-ready infrastructure for demanding production environments.
Engineers Need Faster Iteration
Slow training slows product velocity. Our high-performance systems eliminate queues and throttling, enabling instant experimentation. Faster iteration and shorter shipping cycles keep engineers unblocked, operating at startup speed while meeting enterprise demands for reliability, scalability, and long-term growth today globally.
Cloud Cost are Insane
Cloud GPUs are convenient, until they become your largest monthly expense. Our workstations and servers often pay for themselves in 4–8 weeks, giving you predictable, fixed-cost compute with no surprise billing and no resource throttling.