How Much VRAM Do I Need for AI?
VRAM is the single most important spec for AI hardware. Get it wrong and the model either fails to load or runs at a fraction of its potential. Get it right and the GPU becomes a long-lived asset. This guide walks through the actual tiers, the math behind them, and what models fit where.
Why VRAM Matters More Than CUDA Cores
VRAM determines what you can run. Compute determines how fast. A model that exceeds available VRAM either refuses to load or spills layers to system RAM with a 5-10x speed penalty. The correct buying priority is: VRAM capacity, then memory bandwidth, then compute throughput.
For AI inference, the GPU is mostly moving weights through tensor cores. If the weights do not fit in VRAM, none of the other specs matter.
The VRAM Math, Briefly
Total VRAM consumption is the sum of three things:
- Model weights. Roughly equal to (parameters × bytes per parameter). At FP16 that is 2 bytes per parameter; at Q4 it is roughly 0.5 bytes per parameter.
- KV cache. Grows linearly with context length. Negligible at small contexts, dominant at 32K+.
- Overhead. CUDA runtime, framework buffers, activations during inference. Budget 1-3GB on top of the model.
A useful rule of thumb: at Q4_K_M quantization, a model needs roughly parameters_in_billions × 0.6 GB of VRAM for weights plus a context-dependent KV cache.
The VRAM Tiers
24GB Tier — Entry Professional
Cards: NVIDIA RTX PRO 4000 Blackwell (24GB GDDR7 ECC), RTX 4090 (24GB GDDR6X, consumer), RTX 5000 Ada Generation (32GB).
What fits: 7B to 13B parameter models at Q4 or Q8 with comfortable headroom. 24B to 30B models at aggressive Q4 quantization with limited context. KV cache for long contexts on smaller models.
Right for: Local LLM development on 7-13B models, single-user prototyping, CAD and rendering workloads, smaller computer vision models. The minimum tier for any serious AI work.
Not enough for: 70B models at usable quality, fine-tuning beyond 7B, multi-user inference.
48GB Tier — Mid Professional
Cards: NVIDIA RTX 6000 Ada Generation (48GB GDDR6 ECC), NVIDIA L40S (48GB GDDR6 ECC).
What fits: 70B models at Q4_K_M quantization with modest context (Llama 3.1 70B at Q4 uses ~40-43GB). 30B class models at Q8 with long context. LoRA fine-tuning on 13B to 30B models. Production single-GPU inference for 30B class models.
Right for: Serious local LLM work on 70B class models, mid-size fine-tuning, ISV-certified visualization workloads, single-user enterprise AI development.
Not enough for: 70B at Q8 with long context, full fine-tuning of 70B+, frontier models.
96GB Tier — Top Professional Workstation
Cards: NVIDIA RTX PRO 6000 Blackwell, both Workstation Edition and Server Edition (96GB GDDR7 ECC).
What fits: 70B models at Q4 with long context (32K+) comfortably. 70B at Q8 with moderate context. LoRA and QLoRA fine-tuning on 70B class models. Up to 32B models at FP16 for full-precision experiments. Multiple concurrent inference workloads via MIG partitioning.
Right for: Top-tier single-GPU workstations for AI development, single-card 70B production inference, fine-tuning on 70B class models with adapter methods, agentic AI development with long context.
Notable limit: The RTX PRO 6000 Blackwell is PCIe-only — no NVLink. Multi-GPU configurations communicate over PCIe Gen 5 x16, which is enough for tensor parallelism but slower than NVLink for training-scale workloads.
80GB to 192GB Tier — Datacenter HBM
Cards: H100 SXM (80GB HBM3), H200 SXM (141GB HBM3e), B200 (180-192GB HBM3e).
What fits: H200 at 141GB holds full FP16 Llama 70B (~140GB) on a single GPU. B200 at 180-192GB holds the same with room to spare. With NVLink (900GB/s on H100/H200, faster on B200), multi-GPU tensor parallelism scales efficiently for training and large-batch inference.
Right for: Production inference serving at scale, full pre-training and full fine-tuning of 70B+ models, frontier research, large multi-tenant deployments.
The tradeoff: Datacenter form factor only. These cards are designed for rack servers with high airflow and 700W to 1000W power per GPU. They are not workstation cards.
Model-to-VRAM Quick Reference
| Model class | Q4_K_M VRAM | Q8_0 VRAM | FP16 VRAM | Practical tier |
|---|---|---|---|---|
| 7B (Mistral, Llama 3.1 8B) | ~5 GB | ~9 GB | ~16 GB | 24 GB |
| 13B (Llama 2 13B class) | ~8 GB | ~14 GB | ~26 GB | 24 GB |
| 30B-34B | ~20 GB | ~36 GB | ~70 GB | 24-48 GB |
| 70B (Llama 3.1 70B) | ~43 GB | ~75 GB | ~140 GB | 48-96 GB |
| 405B (Llama 3.1 405B) | ~230 GB | ~410 GB | ~810 GB | Multi-GPU HBM |
Numbers exclude KV cache and runtime overhead. Add 1-3GB for short context, 3-8GB for long context.
Quantization Tradeoffs
Quantization is how you fit big models into smaller VRAM budgets. The cost is quality, and the curve is not linear.
| Quantization | VRAM vs FP16 | Quality impact |
|---|---|---|
| FP16 | 100% | Native, no loss |
| Q8_0 | ~50% | Near-lossless |
| Q5_K_M | ~35% | Very close to Q8 |
| Q4_K_M | ~28% | ~5% quality loss, sweet spot |
| Q3_K_M | ~22% | Noticeable degradation |
| Q2_K | ~18% | Significant degradation |
Practical advice: Q4_K_M is the production default for most local LLM workloads. Q5 or Q8 if VRAM allows. Below Q3, coherence drops sharply — a smaller model at higher precision usually beats a larger one at Q2.
For specialized cases, AWQ INT4 (~35GB for Llama 3.1 70B) and GPTQ deliver similar quality to Q4_K_M with better throughput on supported runtimes like vLLM and TensorRT-LLM.
KV Cache and Long Context
The KV cache stores the attention keys and values for every token in the context window. It grows linearly with context length and quadratically with batch size.
For Llama 3.1 70B at FP16 KV cache:
- 4K context: ~1.3 GB
- 8K context: ~2.6 GB
- 32K context: ~10 GB
- 128K context: ~40 GB
At 128K context, the KV cache rivals the model weights themselves. For long-context workloads, KV cache quantization (Q8 or Q4) cuts this in half or quarter with minimal quality impact in most runtimes.
Multi-GPU VRAM Pooling
Two GPUs can serve a model that exceeds either card's individual VRAM by splitting the model layers across both cards. This is tensor parallelism. It works through PCIe on workstation GPUs and through NVLink on datacenter cards.
Examples:
- Two RTX 6000 Ada (48GB each) → 96GB pooled, runs Llama 3.1 70B at Q8.
- Two RTX PRO 6000 Blackwell (96GB each) → 192GB pooled, runs Llama 3.1 70B at FP16 with long context.
- Two H100 SXM (80GB each) → 160GB pooled with NVLink at 900GB/s.
- Two H200 SXM (141GB each) → 282GB pooled, runs Llama 3.1 405B at Q4.
The catch: PCIe is roughly 64GB/s in each direction at Gen 5 x16. NVLink is roughly 900GB/s on H100/H200. For inference, PCIe is usually sufficient. For training, NVLink matters significantly.
VRAM for Fine-Tuning
Fine-tuning needs more VRAM than inference for the same model. The optimizer state, gradients, and activations all live in VRAM during training.
| Method | VRAM vs inference | Notes |
|---|---|---|
| Full fine-tuning (Adam) | ~4x | Optimizer holds 2 extra copies of weights at FP32 |
| Mixed-precision full fine-tuning | ~3x | Standard practice with bf16/fp16 |
| LoRA | ~1.5-2x | Trains small adapter, freezes base model |
| QLoRA | ~1.2-1.5x | 4-bit base + LoRA adapter |
For LoRA on 7B, 16-24GB is enough. For LoRA on 70B, 48-96GB. For full fine-tuning of 70B, multiple datacenter GPUs.
How to Pick the Right Tier
- Identify the largest model you actually run. Not the largest you might want to run someday — the one you use weekly.
- Decide quantization tolerance. Q4 saves 75% VRAM with ~5% quality loss. Production inference can usually accept Q4; research often needs Q8 or FP16.
- Add context budget. For 32K+ context, add 5-10GB to your weight estimate.
- Add fine-tuning multiplier if applicable. Multiply by 1.5x for LoRA, 3-4x for full fine-tuning.
- Round up to the next tier. Buying just enough leaves no headroom for new models or longer context.
Buyer FAQ
Request a VRAM-sized quote →




