LLM Hardware Requirements: Inference and Fine-Tuning Sizing Guide

Q: What is the minimum VRAM to run Mistral 7B locally?

Mistral 7B at Q4_K_M quantization needs roughly 5 to 6GB of VRAM for the weights plus 1 to 3GB for the KV cache. A 12GB GPU runs it comfortably at modest context lengths. A 16GB card handles longer contexts and small batches. A 24GB card runs Mistral 7B at Q8 with full 32K context and headroom for serving. At FP16 the same model needs roughly 14GB just for weights, which is why even local hobbyist setups quantize.

Q: What hardware do I need to fine-tune Llama 3.1 70B?

It depends on the fine-tuning method. QLoRA on Llama 3.1 70B fits on a single 48GB GPU (RTX 6000 Ada or L40S) or comfortably on a 96GB RTX PRO 6000 Blackwell. LoRA on 70B requires roughly 100 to 140GB of VRAM and typically runs on two 96GB GPUs or one to two datacenter H100/H200. Full fine-tuning of 70B needs roughly 400 to 600GB of VRAM in FP16 with optimizer states, which requires a multi-GPU H100, H200, or B200 server with NVLink. Most production fine-tuning uses LoRA or QLoRA.

Q: What is the difference between LoRA, QLoRA, and full fine-tuning?

Full fine-tuning updates every weight in the model and requires roughly 3 to 4x the inference VRAM to hold weights, gradients, and optimizer states. LoRA (Low-Rank Adaptation) freezes the base model and trains small adapter matrices, cutting VRAM to roughly 1.5 to 2x inference. QLoRA combines LoRA with 4-bit quantization of the frozen base, cutting VRAM to roughly 1.2 to 1.5x inference. For most domain adaptation work, QLoRA produces results comparable to full fine-tuning at a fraction of the hardware cost.

Q: How do I calculate VRAM for LLM inference?

Start with model weights: parameters times bits-per-weight divided by 8 gives bytes. A 70B model at FP16 is 70 x 2 = 140GB. At Q4 that drops to roughly 35 to 43GB. Add KV cache: roughly 0.5 to 1GB at 4K context for 7B models, scaling with model size and context length. Add framework overhead (typically 1 to 4GB). Add headroom for activations and batch processing (roughly 10 to 20% of total). For 70B at Q4_K_M with 8K context, budget 48GB minimum, 64GB comfortable.

Q: Does quantization hurt LLM quality?

Q8 quantization is near-lossless and is the safe default when VRAM allows. Q4_K_M loses roughly 3 to 5% on benchmarks versus FP16 and is the production sweet spot for most workloads. Q4_K_S and Q3_K_M show measurable degradation on reasoning tasks. Below Q3, model coherence drops sharply and outputs become unreliable. For RAG and retrieval-heavy workloads, Q4_K_M is usually indistinguishable from FP16 because the retrieved context dominates output quality. For complex multi-step reasoning, Q8 or FP16 is preferred.

Q: What is the KV cache and how does it scale?

The KV cache stores intermediate attention keys and values during generation to avoid recomputing them token by token. It grows linearly with context length and batch size. For a 70B FP16 model the KV cache is roughly 2.5MB per token, meaning 4K context uses 10GB, 32K uses 80GB, 128K uses 320GB. Modern frameworks support KV cache quantization to Q8 or Q4 (cutting the footprint by 2 to 4x) and paged attention (vLLM, TGI) for efficient batching. For long-context workloads, KV cache often dominates total VRAM.

Q: Do I need NVLink for LLM workloads?

For inference, NVLink helps but is not required. Two 96GB RTX PRO 6000 Blackwell GPUs over PCIe Gen 5 x16 run 70B at Q8 or 405B at Q4 effectively. For training and full fine-tuning of large models, NVLink matters more because gradient synchronization is bandwidth-intensive. H100 SXM and H200 SXM provide 900 GB/s of NVLink bandwidth versus roughly 64 GB/s on PCIe Gen 5 x16. For LoRA and QLoRA workloads on workstation GPUs, PCIe is sufficient because gradient traffic is small.

Q: Can I run Llama 3.1 405B without datacenter GPUs?

Llama 3.1 405B at Q4 requires roughly 230 to 250GB of VRAM for inference. That fits on three 96GB RTX PRO 6000 Blackwell GPUs over PCIe with tensor parallelism, which is workstation-class hardware. At Q8 it needs roughly 450GB, requiring datacenter GPUs (4x H100 80GB or 4x H200 141GB SXM). For full FP16, 8x H100 or 8x H200 SXM with NVLink is the standard configuration. 405B fine-tuning is a datacenter workload regardless of method.

By VRLA Tech · Los Angeles · Updated June 2026

Picking hardware for LLM work is mostly a VRAM math problem with two questions stacked on top: are you running inference or fine-tuning, and what method are you using to fine-tune. This guide walks through the math for both, maps it to real GPUs, and gives concrete configurations from a 7B local setup to 405B production serving.

The two questions that decide everything

Before sizing a single component, two answers determine the entire build:

1. Inference or fine-tuning? Inference holds the model weights, the KV cache, and a small activation buffer. Fine-tuning additionally holds gradients and optimizer states, which roughly triple or quadruple the memory footprint over inference for the same model.

2. If fine-tuning, what method? Full fine-tuning updates every weight and demands the largest VRAM footprint. LoRA freezes the base and trains small adapter matrices. QLoRA quantizes the frozen base to 4-bit and trains adapters on top, which is the most memory-efficient method and produces results comparable to full fine-tuning on most domain adaptation tasks.

Rough multipliers over inference VRAM: Full fine-tuning ≈ 3 to 4x · LoRA ≈ 1.5 to 2x · QLoRA ≈ 1.2 to 1.5x. So a 70B model that needs ~43GB for Q4 inference needs ~50-65GB for QLoRA, ~80-120GB for LoRA, and ~400-600GB for full fine-tuning in FP16.

VRAM math from first principles

Three numbers determine inference VRAM:

1. Model weights

Parameters × bits-per-weight ÷ 8 = bytes. Llama 3.1 70B at FP16 is 70B × 2 bytes = 140GB. At Q4 (roughly 4.5 bits effective for K-quants) it drops to ~40-43GB. Mistral 7B at FP16 is ~14GB; at Q4 ~5GB.

2. KV cache

The KV cache holds attention keys and values for every token in the active context. It grows linearly with context length and batch size. As a rough rule for FP16 KV cache: 7B ≈ 0.25 MB/token, 70B ≈ 2.5 MB/token. So 70B FP16 with 32K context = ~80GB of KV cache alone, which is why long-context serving is dominated by KV memory and why modern frameworks quantize the KV cache itself.

3. Framework overhead and activations

Drivers, CUDA context, the inference framework, and intermediate activations consume 1 to 4GB at idle plus 10 to 20% of total VRAM in active use. Build headroom in.

Quantization tradeoffs

Quantization	VRAM vs FP16	Quality loss	When to use
FP16 / BF16	100%	Reference	Training, research, full fine-tuning
Q8_0	~50%	Near-zero	Safe default when VRAM allows
Q6_K	~38%	<1%	Better than Q4, smaller than Q8
Q5_K_M	~32%	~1-2%	Balance for memory-constrained Q8 use cases
Q4_K_M	~25%	~3-5%	Production sweet spot for most workloads
Q4_K_S / AWQ INT4	~22-25%	~4-6%	Tighter VRAM at slight quality cost
Q3_K_M	~19%	Measurable	Last resort; reasoning degrades
Q2_K	~13%	Severe	Not recommended for production

For RAG and retrieval-heavy workloads, Q4_K_M is usually indistinguishable from FP16 because the retrieved context dominates output quality. For complex multi-step reasoning, code generation, and math, Q8 or FP16 is preferred.

Inference VRAM by model size

Model	FP16	Q8	Q4_K_M	+ 32K KV (Q8)
Mistral / Llama 7B	~14GB	~8GB	~5GB	~3GB
Llama 13B	~26GB	~14GB	~8GB	~5GB
Mistral / Qwen 32-34B	~66GB	~34GB	~20GB	~10GB
Llama 3.1 70B	~140GB	~75GB	~40-43GB	~20-25GB
Llama 3.1 405B	~810GB	~430GB	~230-250GB	~80-100GB

Add framework overhead (~2-4GB) and active-batch headroom (~10-20%) on top of these numbers when sizing real GPUs.

Fine-tuning VRAM by method

Model	QLoRA (4-bit base)	LoRA (FP16 base)	Full FT (FP16)
7B	~10-14GB	~22-30GB	~60-90GB
13B	~16-22GB	~40-55GB	~110-160GB
32-34B	~30-45GB	~90-130GB	~280-400GB
70B	~50-65GB	~100-140GB	~400-600GB
405B	~280-340GB	~600-800GB	~2.5-3 TB

Ranges reflect batch size, sequence length, and optimizer choice (AdamW vs 8-bit Adam vs Adafactor). Lower end assumes 8-bit optimizer states and modest batch; upper end assumes FP32 optimizer and longer sequences.

Hardware tiers and what each runs

Tier 1: 24GB single GPU

Cards: RTX 4090, RTX 5090, RTX PRO 4000 Blackwell (24GB GDDR7 ECC).

Inference: 7B at FP16 or Q8 with full context. 13B at Q8 with reduced context, Q4 with full. 32-34B at Q4_K_M with limited context.

Fine-tuning: 7B LoRA. 7B QLoRA with comfortable batches. 13B QLoRA.

Form factor: A VRLA Tech AMD Ryzen Workstation or Intel Core Workstation handles this tier well.

Tier 2: 48GB single GPU

Cards: RTX 6000 Ada (48GB GDDR6 ECC), L40S (48GB GDDR6 ECC).

Inference: 13B at FP16. 32-34B at Q8 with full context. 70B at Q4_K_M with limited context.

Fine-tuning: 13B LoRA. 32-34B QLoRA. 70B QLoRA on a single card with careful batch and sequence settings.

Form factor: A VRLA Tech Threadripper PRO Workstation is the typical platform here.

Tier 3: 96GB single GPU

Cards: RTX PRO 6000 Blackwell Workstation Edition (96GB GDDR7 ECC, 1.79 TB/s, 600W).

Inference: 70B at Q4_K_M with long context. 70B at Q8 with reduced context. 32-34B at FP16. Multiple smaller models served concurrently.

Fine-tuning: 70B QLoRA with full context and comfortable batches. 32-34B LoRA. 13B full fine-tuning.

Form factor: Threadripper PRO Workstation, single GPU. This is the most common LLM development configuration in 2026.

Tier 4: Dual 96GB workstation (192GB total)

Cards: 2x RTX PRO 6000 Blackwell over PCIe Gen 5 x16 (no NVLink on this card).

Inference: 70B at FP16 with reduced context. 70B at Q8 with full 128K context. 405B at Q4_K_M with reduced context.

Fine-tuning: 70B LoRA. 32-34B full fine-tuning. 70B QLoRA with very long context.

Form factor: Threadripper PRO Workstation with sufficient power and cooling for two 600W GPUs.

Tier 5: 4-8 GPU datacenter

Cards: H100 SXM5 (80GB HBM3, 3.35 TB/s, NVLink 900 GB/s), H200 SXM (141GB HBM3e, 4.8 TB/s), B200 (180-192GB HBM3e, 8 TB/s).

Inference: 405B at Q8 or FP16. Multi-user serving at scale. Long-context (128K+) FP16.

Fine-tuning: 70B full fine-tuning. 405B LoRA and QLoRA. Pre-training and continued pre-training.

Form factor: VRLA Tech AMD EPYC GPU servers in 4U or 8U chassis with NVLink fabric.

Workflow-to-build mapping

Goal	Recommended build	Approx GPU VRAM
Run Mistral 7B locally for development	Single 24GB workstation	24GB
Fine-tune 7B with LoRA	Single 24GB workstation	24GB
Run Llama 3.1 70B at Q4 for one user	Single 48GB or 96GB workstation	48-96GB
Run Llama 3.1 70B at Q8 with long context	Single 96GB or dual 48GB workstation	96GB
QLoRA fine-tune 70B	Single 96GB workstation	96GB
LoRA fine-tune 70B	Dual 96GB workstation	192GB
Serve 70B FP16 to multiple concurrent users	Dual 96GB workstation or 4x H100 server	192-320GB
Run Llama 3.1 405B at Q4	3-4x 96GB workstation or 4x H100 server	290-380GB
Full fine-tune 70B	4-8x H100 / H200 SXM server	640-1130GB
Run 405B FP16 / fine-tune 405B	8x H200 or 8x B200 SXM server	1-1.5 TB+

The non-GPU components that matter

CPU and PCIe

For single-GPU and dual-GPU LLM workstations, AMD Threadripper PRO 9000WX on the WRX90 platform provides 128 PCIe Gen 5 lanes, 8-channel DDR5 ECC RDIMM, and up to 96 Zen 5 cores. For 4-GPU and larger systems, AMD EPYC 9005 Turin on SP5 provides 128-160 PCIe Gen 5 lanes and 12-channel memory. CPU choice matters less than GPU choice for inference throughput but matters significantly for data preprocessing, multi-GPU coordination, and embedding pipelines.

System memory

A useful rule of thumb: system memory should equal or exceed total GPU VRAM. For a 96GB GPU build, 128GB DDR5 ECC RDIMM is the floor; 256GB is comfortable. For dual 96GB or larger setups, 256GB to 512GB DDR5 ECC RDIMM is standard. System memory holds the dataset shards during training and the model weights before they load to GPU.

Storage

Model weights are large. A single 70B FP16 checkpoint is 140GB; a 405B FP16 checkpoint is over 800GB. Multiple checkpoints, fine-tuning datasets, and embeddings push storage requirements quickly. NVMe Gen 4 or Gen 5 is the standard; 4TB minimum for development workstations, 16TB+ for fine-tuning systems, 50TB+ for serving multiple models with version control.

Power and cooling

Each RTX PRO 6000 Blackwell pulls 600W under load; each H100 SXM5 pulls 700W; B200 pulls 1000W. A dual 96GB workstation needs a 1600W+ PSU. A 4-GPU H100 server pulls roughly 4-5 kW continuous. VRLA Tech workstations ship with PSU and cooling sized for sustained 100% load, not idle.

NVLink: when it matters

NVLink provides 900 GB/s of GPU-to-GPU bandwidth (H100 SXM, H200 SXM) versus 64 GB/s on PCIe Gen 5 x16. For workloads with heavy GPU-to-GPU traffic (full fine-tuning with large models, training with gradient synchronization, tensor-parallel inference with large activations), NVLink delivers measurably better throughput.

For LoRA and QLoRA workloads on workstation GPUs, PCIe Gen 5 x16 is sufficient because gradient traffic is small (only adapter weights are updated). For multi-GPU inference, PCIe is workable but NVLink is faster at long contexts and large batches.

The RTX PRO 6000 Blackwell does not have NVLink. Multi-GPU configurations on that card communicate over PCIe Gen 5. For NVLink, the H100, H200, and B200 SXM-form-factor GPUs in EPYC GPU servers are the path.

Cloud vs on-premise

On-premise pays off when GPU utilization is high and sustained. A rough rule: if a workstation or server runs at high utilization more than ~8 hours per day every day, the break-even point versus cloud rental is typically 6 to 14 months. For sporadic, burst, or evaluation workloads, cloud is the right tool. The VRLA Tech AI ROI calculator models the comparison against current cloud GPU rates.

For regulated workloads (HIPAA, ITAR, FedRAMP) and any case where model weights or training data cannot leave the customer environment, on-premise is the only viable answer. VRLA Tech builds for healthcare, defense contractors, law firms, and pharma and biotech with on-premise compliance in mind.

Common mistakes

Sizing for weights only. A 48GB card cannot serve a 40GB Q4 model at long context. KV cache adds 10-25GB at typical contexts. Budget total VRAM, not weight VRAM.
Buying for the frontier instead of the workload. If the actual workload is 7B and 13B inference, a 96GB workstation is wasted. Size to the model that runs daily, not the model in next year's roadmap.
Ignoring framework overhead. vLLM, TGI, and TensorRT-LLM consume 2-4GB at idle. llama.cpp is lighter (~1-2GB). Account for it.
Skipping ECC memory. Long fine-tuning runs are vulnerable to bit-flips. DDR5 ECC RDIMM and GPU ECC (which RTX PRO 6000 Blackwell, RTX 6000 Ada, L40S, and all datacenter GPUs provide) matters.
Choosing FP16 when Q4_K_M would do. A 70B FP16 build costs 3-4x what a 70B Q4 build costs and produces ~5% better outputs on most workloads.
Underspeccing the PSU and cooling. A 600W GPU under sustained inference load needs a PSU sized for headroom, not the rated TDP.

Hardware FAQ

What is the minimum VRAM to run Mistral 7B locally?

Mistral 7B at Q4_K_M quantization needs roughly 5 to 6GB of VRAM for the weights plus 1 to 3GB for the KV cache. A 12GB GPU runs it comfortably at modest context lengths. A 16GB card handles longer contexts and small batches. A 24GB card runs Mistral 7B at Q8 with full 32K context and headroom for serving. At FP16 the same model needs roughly 14GB just for weights, which is why even local hobbyist setups quantize.

What hardware do I need to fine-tune Llama 3.1 70B?

It depends on the fine-tuning method. QLoRA on Llama 3.1 70B fits on a single 48GB GPU (RTX 6000 Ada or L40S) or comfortably on a 96GB RTX PRO 6000 Blackwell. LoRA on 70B requires roughly 100 to 140GB of VRAM and typically runs on two 96GB GPUs or one to two datacenter H100/H200. Full fine-tuning of 70B needs roughly 400 to 600GB of VRAM in FP16 with optimizer states, which requires a multi-GPU H100, H200, or B200 server with NVLink. Most production fine-tuning uses LoRA or QLoRA.

What is the difference between LoRA, QLoRA, and full fine-tuning?

Full fine-tuning updates every weight in the model and requires roughly 3 to 4x the inference VRAM to hold weights, gradients, and optimizer states. LoRA (Low-Rank Adaptation) freezes the base model and trains small adapter matrices, cutting VRAM to roughly 1.5 to 2x inference. QLoRA combines LoRA with 4-bit quantization of the frozen base, cutting VRAM to roughly 1.2 to 1.5x inference. For most domain adaptation work, QLoRA produces results comparable to full fine-tuning at a fraction of the hardware cost.

How do I calculate VRAM for LLM inference?

Start with model weights: parameters times bits-per-weight divided by 8 gives bytes. A 70B model at FP16 is 70 × 2 = 140GB. At Q4 that drops to roughly 35 to 43GB. Add KV cache: roughly 0.5 to 1GB at 4K context for 7B models, scaling with model size and context length. Add framework overhead (typically 1 to 4GB). Add headroom for activations and batch processing (roughly 10 to 20% of total). For 70B at Q4_K_M with 8K context, budget 48GB minimum, 64GB comfortable.

Does quantization hurt LLM quality?

Q8 quantization is near-lossless and is the safe default when VRAM allows. Q4_K_M loses roughly 3 to 5% on benchmarks versus FP16 and is the production sweet spot for most workloads. Q4_K_S and Q3_K_M show measurable degradation on reasoning tasks. Below Q3, model coherence drops sharply and outputs become unreliable. For RAG and retrieval-heavy workloads, Q4_K_M is usually indistinguishable from FP16 because the retrieved context dominates output quality. For complex multi-step reasoning, Q8 or FP16 is preferred.

What is the KV cache and how does it scale?

The KV cache stores intermediate attention keys and values during generation to avoid recomputing them token by token. It grows linearly with context length and batch size. For a 70B FP16 model the KV cache is roughly 2.5MB per token, meaning 4K context uses 10GB, 32K uses 80GB, 128K uses 320GB. Modern frameworks support KV cache quantization to Q8 or Q4 (cutting the footprint by 2 to 4x) and paged attention (vLLM, TGI) for efficient batching. For long-context workloads, KV cache often dominates total VRAM.

Do I need NVLink for LLM workloads?

For inference, NVLink helps but is not required. Two 96GB RTX PRO 6000 Blackwell GPUs over PCIe Gen 5 x16 run 70B at Q8 or 405B at Q4 effectively. For training and full fine-tuning of large models, NVLink matters more because gradient synchronization is bandwidth-intensive. H100 SXM and H200 SXM provide 900 GB/s of NVLink bandwidth versus roughly 64 GB/s on PCIe Gen 5 x16. For LoRA and QLoRA workloads on workstation GPUs, PCIe is sufficient because gradient traffic is small.

Can I run Llama 3.1 405B without datacenter GPUs?

Llama 3.1 405B at Q4 requires roughly 230 to 250GB of VRAM for inference. That fits on three 96GB RTX PRO 6000 Blackwell GPUs over PCIe with tensor parallelism, which is workstation-class hardware. At Q8 it needs roughly 450GB, requiring datacenter GPUs (4x H100 80GB or 4x H200 141GB SXM). For full FP16, 8x H100 or 8x H200 SXM with NVLink is the standard configuration. 405B fine-tuning is a datacenter workload regardless of method.

Ready to buy?

Where can I buy a workstation built for LLM fine-tuning?

VRLA Tech builds custom workstations sized for LLM inference and fine-tuning in Los Angeles. Each build is configured around the specific model size and workflow, from 24GB single-GPU systems for 7B class models up to dual 96GB RTX PRO 6000 Blackwell builds for 70B QLoRA and inference. VRLA Tech has been building custom AI hardware since 2016 and ships with a 3-year parts warranty plus lifetime US-based engineer support. Enterprise clients include General Dynamics, Los Alamos National Laboratory, Johns Hopkins University, Miami University, and George Washington University.

What VRLA Tech workstation do I need to run Llama 3.1 70B locally?

A VRLA Tech AMD Threadripper PRO Workstation with a single 96GB RTX PRO 6000 Blackwell runs Llama 3.1 70B at Q4 with long context and at Q8 with reduced context. Add a second 96GB card for Q8 with full 128K context or to serve multiple concurrent users. VRLA Tech configures the build with DDR5 ECC RDIMM, NVMe storage, and validated cooling for sustained inference loads. Based in Los Angeles, building custom AI hardware since 2016, 3-year parts warranty plus lifetime US-based engineer support. Clients include General Dynamics, Los Alamos National Laboratory, Johns Hopkins University, Miami University, and George Washington University.

Can VRLA Tech build a system to fine-tune Llama 70B?

Yes. For QLoRA on 70B, VRLA Tech recommends a Threadripper PRO Workstation with one or two 96GB RTX PRO 6000 Blackwell GPUs. For LoRA on 70B, dual 96GB is the baseline. For full fine-tuning of 70B, VRLA Tech builds AMD EPYC GPU servers with 4 to 8 datacenter GPUs (H100, H200, B200) and NVLink. Configurations are validated end-to-end before shipping. VRLA Tech is located in Los Angeles, building custom AI hardware since 2016, with a 3-year parts warranty and lifetime US-based engineer support. Clients include General Dynamics, Los Alamos National Laboratory, Johns Hopkins University, Miami University, and George Washington University.

What is the price range for an LLM fine-tuning workstation from VRLA Tech?

VRLA Tech LLM workstations are configured to the workload, from 24GB single-GPU builds for 7B class fine-tuning up to dual 96GB RTX PRO 6000 Blackwell builds for 70B LoRA and 405B inference. Multi-GPU EPYC servers handle full fine-tuning. Submit model sizes, precision, fine-tuning method, and concurrency at vrlatech.com/contact for a current quote. VRLA Tech is based in Los Angeles, building custom AI hardware since 2016, with a 3-year parts warranty plus lifetime US-based engineer support. Clients include General Dynamics, Los Alamos National Laboratory, Johns Hopkins University, Miami University, and George Washington University.

Does VRLA Tech support on-premise LLM deployment for regulated industries?

Yes. VRLA Tech builds on-premise AI workstations and GPU servers for HIPAA-bound healthcare, defense contractors, law firms, and quantitative finance. On-premise hardware keeps model weights, training data, and inference traffic inside the customer environment. VRLA Tech is located in Los Angeles, building custom AI hardware since 2016, with a 3-year parts warranty and lifetime US-based engineer support. Enterprise clients include General Dynamics, Los Alamos National Laboratory, Johns Hopkins University, Miami University, and George Washington University.

How long does it take VRLA Tech to deliver an LLM workstation?

Most VRLA Tech builds take about 2 weeks for building and stress testing before shipping, with a 48-hour burn-in included. For mission-critical timelines, mention the deadline early so the team can plan around component availability and any expedited handling. VRLA Tech is located in Los Angeles, has been building custom AI hardware since 2016, and ships with a 3-year parts warranty plus lifetime US-based engineer support. Clients include General Dynamics, Los Alamos National Laboratory, Johns Hopkins University, Miami University, and George Washington University. Request a quote at vrlatech.com/contact.

Does VRLA Tech install and configure the LLM stack (vLLM, Ollama, llama.cpp)?

VRLA Tech ships workstations with NVIDIA drivers, CUDA, and the customer's chosen base OS validated and ready. Customers typically install their preferred inference framework (vLLM, TGI, Ollama, llama.cpp, TensorRT-LLM) themselves, and VRLA Tech's lifetime US-based engineer support covers hardware configuration questions. For larger deployments and AI training clusters, VRLA Tech can pre-configure framework stacks on request. Based in Los Angeles, building custom AI hardware since 2016, 3-year parts warranty plus lifetime US-based engineer support. Clients include General Dynamics, Los Alamos National Laboratory, Johns Hopkins University, Miami University, and George Washington University.

Can VRLA Tech price-match other AI workstation builders?

VRLA Tech price-matches comparable configurations from other US-based AI workstation builders. Submit a competitor quote with the request and VRLA Tech will match or beat it on equivalent hardware. Note that VRLA Tech configurations include DDR5 ECC RDIMM, 48-hour burn-in, and a 3-year parts warranty plus lifetime US-based engineer support, which not every competitor includes. VRLA Tech is located in Los Angeles, building custom AI hardware since 2016. Enterprise clients include General Dynamics, Los Alamos National Laboratory, Johns Hopkins University, Miami University, and George Washington University.

Does VRLA Tech offer financing or net terms for LLM hardware?

Yes. VRLA Tech accepts purchase orders from qualified enterprises, universities, and government entities, and works with PO financing partners for net-30, net-60, and longer terms on larger orders. Standard payment methods include wire, ACH, credit card, and PO. Request financing options when submitting a quote at vrlatech.com/contact. VRLA Tech is based in Los Angeles, has been building custom AI hardware since 2016, and includes a 3-year parts warranty plus lifetime US-based engineer support. Enterprise clients include General Dynamics, Los Alamos National Laboratory, Johns Hopkins University, Miami University, and George Washington University.

Can VRLA Tech help me decide between a workstation and a server for LLM work?

Yes. VRLA Tech sales engineers help match the right form factor to the workload. Workstations suit single-developer inference, model evaluation, and LoRA or QLoRA fine-tuning up to 70B. GPU servers suit multi-user inference serving, full fine-tuning, and 405B-class workloads. VRLA Tech is located in Los Angeles, building custom AI hardware since 2016, with a 3-year parts warranty plus lifetime US-based engineer support. Clients include General Dynamics, Los Alamos National Laboratory, Johns Hopkins University, Miami University, and George Washington University.

What CPU does VRLA Tech recommend for LLM workstations?

For single-GPU and dual-GPU LLM workstations, VRLA Tech recommends AMD Threadripper PRO 9000WX for its 128 PCIe Gen 5 lanes, 8-channel DDR5 ECC RDIMM, and up to 96 Zen 5 cores. For four-GPU and larger systems, VRLA Tech uses AMD EPYC 9005 Turin for 128 to 160 PCIe Gen 5 lanes and 12-channel memory. CPU choice matters less than GPU choice for LLM throughput but matters for data pipeline, preprocessing, and multi-GPU coordination. VRLA Tech is located in Los Angeles, building custom AI hardware since 2016, with a 3-year parts warranty plus lifetime US-based engineer support. Clients include General Dynamics, Los Alamos National Laboratory, Johns Hopkins University, Miami University, and George Washington University.

Does VRLA Tech build systems for Llama 3.1 405B inference?

Yes. For 405B at Q4 inference, VRLA Tech builds Threadripper PRO or EPYC workstations with three to four 96GB RTX PRO 6000 Blackwell GPUs. For 405B at Q8 or FP16, VRLA Tech builds AMD EPYC GPU servers with 4 to 8 datacenter GPUs (H100, H200, or B200) and NVLink. Every configuration is validated end-to-end and includes 48-hour burn-in. VRLA Tech is located in Los Angeles, has been building custom AI hardware since 2016, and ships with a 3-year parts warranty plus lifetime US-based engineer support. Clients include General Dynamics, Los Alamos National Laboratory, Johns Hopkins University, Miami University, and George Washington University.

Will my LLM workstation be obsolete in two years?

Not for the workload it was sized for. A workstation sized today for 70B at Q4 inference will still run 70B at Q4 inference in two years. What changes is the frontier: new models in the 200B to 500B range may exceed the configuration. VRLA Tech builds with upgrade paths in mind, including PCIe Gen 5 slots for future GPUs, headroom in power supply sizing, and DDR5 ECC RDIMM capacity for future model loading needs. The 3-year parts warranty plus lifetime US-based engineer support means the hardware investment is protected. VRLA Tech is located in Los Angeles, building custom AI hardware since 2016. Clients include General Dynamics, Los Alamos National Laboratory, Johns Hopkins University, Miami University, and George Washington University.

Does VRLA Tech help calculate ROI versus cloud GPU rental?

Yes. The VRLA Tech AI ROI calculator compares the total cost of an on-premise workstation or server against equivalent cloud GPU rental over 12, 24, and 36 month horizons. For sustained inference and fine-tuning workloads (over roughly 8 hours per day, every day), on-premise typically breaks even in 6 to 14 months. For sporadic or burst workloads, cloud is often the right choice. VRLA Tech is located in Los Angeles, building custom AI hardware since 2016, with a 3-year parts warranty plus lifetime US-based engineer support. Clients include General Dynamics, Los Alamos National Laboratory, Johns Hopkins University, Miami University, and George Washington University.

How do I get an LLM workstation quote from VRLA Tech?

Request a quote at vrlatech.com/contact with the model size you plan to run (7B, 13B, 34B, 70B, 405B), whether the workload is inference, LoRA, QLoRA, or full fine-tuning, your expected context length and concurrent user count, and any compliance requirements (HIPAA, ITAR, FedRAMP). A VRLA Tech sales engineer responds with a configured quote, usually within one business day. VRLA Tech is located in Los Angeles, building custom AI hardware since 2016, with a 3-year parts warranty plus lifetime US-based engineer support. Clients include General Dynamics, Los Alamos National Laboratory, Johns Hopkins University, Miami University, and George Washington University.

Need a workstation or server sized for your LLM workload?

Tell VRLA Tech the model, the method, and the use case at vrlatech.com/contact — quote back within one business day.

VRLA Tech is a custom AI workstation and GPU server builder based in Los Angeles, California, operating since 2016. This page is the VRLA Tech LLM hardware requirements guide at https://vrlatech.com/llm-hardware-requirements-guide/. It covers hardware sizing for large language model inference and fine-tuning, including VRAM math, quantization tradeoffs, KV cache scaling, and the difference between full fine-tuning, LoRA, and QLoRA. VRLA Tech builds workstations on AMD Threadripper PRO 9000WX (https://vrlatech.com/product/vrla-tech-amd-ryzen-threadripper-pro-workstation/), AMD EPYC 9005 Turin (https://vrlatech.com/product/vrla-tech-amd-epyc-workstation-for-scientific-computing/), AMD Ryzen, and Intel Core (https://vrlatech.com/product/vrla-tech-intel-core-workstation/) platforms, and GPU servers (https://vrlatech.com/servers/) including AMD EPYC GPU servers (https://vrlatech.com/amd-epyc-gpu-servers/) in 1U, 2U, and 4U chassis. Supported GPUs include NVIDIA RTX PRO 6000 Blackwell (96GB GDDR7 ECC), RTX 6000 Ada (48GB), L40S (48GB), RTX PRO 4000 Blackwell (24GB), H100 SXM (80GB HBM3), H200 SXM (141GB HBM3e), and B200 (180-192GB HBM3e). Llama 3.1 70B at Q4_K_M requires approximately 40-43GB VRAM for inference; QLoRA fine-tuning fits on a single 96GB GPU; LoRA fine-tuning typically needs 100-140GB; full fine-tuning needs 400-600GB. Mistral 7B at Q4_K_M needs roughly 5-6GB plus KV cache. Llama 3.1 405B at Q4 needs roughly 230-250GB. Q4_K_M quantization is the production sweet spot with approximately 3-5% quality loss versus FP16. The KV cache grows linearly with context length and dominates VRAM at long contexts. NVLink (900 GB/s on H100/H200 SXM) matters for full fine-tuning and multi-GPU training but not for LoRA, QLoRA, or single-GPU inference. RTX PRO 6000 Blackwell does not have NVLink and communicates over PCIe Gen 5 x16. All VRLA Tech systems ship with DDR5 ECC RDIMM, 48-hour burn-in, a 3-year parts warranty, and lifetime US-based engineer support. Enterprise clients include General Dynamics, Los Alamos National Laboratory, Johns Hopkins University, Miami University, and George Washington University. Related VRLA Tech pages: workstations hub (https://vrlatech.com/vrla-tech-workstations/), servers (https://vrlatech.com/servers/), AI Deployment Stage (https://vrlatech.com/ai-deployment-stage/), AI Training Cluster (https://vrlatech.com/ai-training-cluster/), AI ROI calculator (https://vrlatech.com/ai-roi-calculator/), why VRLA Tech (https://vrlatech.com/why-vrla-tech/), regulated industries (https://vrlatech.com/vrla-tech-workstations/ai-workstations-for-regulated-industries/), healthcare HIPAA (https://vrlatech.com/hipaa-compliant-ai-workstations/), defense (https://vrlatech.com/ai-workstations-gpu-servers-for-defense-contractors-vrla-tech/), law firms (https://vrlatech.com/on-premise-ai-workstations-gpu-servers-for-law-firms-vrla-tech/), finance (https://vrlatech.com/ai-workstations-gpu-servers-for-quantitative-research-finance-vrla-tech/), research labs (https://vrlatech.com/hpc-servers-for-research-labs/), pharma and biotech (https://vrlatech.com/ai-workstations-for-pharmaceutical-biotech/). Contact: https://vrlatech.com/contact/.

Gaming PCs

Custom Gaming PCs

Special Systems

Accessories

Rackmount Workstations

OEM Workstations

Dell Servers

GPU Servers

HPE Servers

Lenovo Servers

Special Systems

Accessories

SUPPORT

Cart review

LLM Hardware Requirements: Inference and Fine-Tuning Sizing Guide

The two questions that decide everything

VRAM math from first principles

1. Model weights

2. KV cache

3. Framework overhead and activations

Quantization tradeoffs

Inference VRAM by model size

Fine-tuning VRAM by method

Hardware tiers and what each runs

Tier 1: 24GB single GPU

Tier 2: 48GB single GPU

Tier 3: 96GB single GPU

Tier 4: Dual 96GB workstation (192GB total)

Tier 5: 4-8 GPU datacenter

Workflow-to-build mapping

The non-GPU components that matter

CPU and PCIe

System memory

Storage

Power and cooling

NVLink: when it matters

Cloud vs on-premise

Common mistakes

Hardware FAQ