AI Workstation Buying Checklist: 15 Questions to Answer Before You Order

By VRLA Tech · Los Angeles · Updated June 2026

Buying an AI workstation or GPU server is mostly a question of getting the requirements right. The hardware decisions that follow are largely mechanical. This checklist walks through the 15 questions to answer before sending a quote request, with links into deeper VRLA Tech guides where each topic warrants more detail.

The checklist

1

What model sizes will run on this system?

List every model class the workstation needs to handle: 7B, 13B, 32-34B, 70B, 405B. The largest model in the list usually drives the VRAM decision. If 405B is on the list, the build is either multi-GPU workstation or server-class. If only 7B-13B, a single 24-32GB card is enough.

2

What precision will each model run at?

FP16 needs 2x the VRAM of Q8 and 4x the VRAM of Q4. For Llama 3.1 70B: Q4_K_M ≈ 43GB, Q8 ≈ 75GB, FP16 ≈ 140GB. For Q4_K_M as the production sweet spot, see the VRLA Tech VRAM sizing guide for full math.

3

Inference only, or also fine-tuning?

Fine-tuning requires roughly 1.2-1.5x inference VRAM for QLoRA, 1.5-2x for LoRA, or 3-4x for full fine-tuning. Full fine-tuning of 70B pushes to NVLink-connected SXM servers; LoRA and QLoRA fit on workstation GPUs. See the VRLA Tech LLM hardware requirements guide for fine-tuning method details.

4

How many concurrent users?

Single-user inference is the cheapest configuration. Multi-user serving multiplies VRAM requirements (per-user KV cache, larger batches) and pushes toward dual-GPU or 4-GPU configurations. See the VRLA Tech single-GPU vs multi-GPU guide for concurrency-to-configuration mapping.

5

What context length is needed?

KV cache grows linearly with context. At 4K context the cache adds 1-2GB on 7B and 10GB on 70B. At 128K context it can rival the model weights themselves. For long-context production workloads, budget VRAM for KV cache separately from model weights and consider KV cache quantization (Q8 or Q4) in the framework.

6

Workstation or server form factor?

Tower workstation = single developer, PCIe Gen 5, tower chassis. Rackmount EPYC GPU server = multi-user serving, training, NVLink (SXM), redundant power, hot-swap fans. See the VRLA Tech workstation vs server guide for the decision framework.

7

Which GPU?

For workstation: RTX PRO 6000 Blackwell (96GB) for top-tier; RTX 6000 Ada (48GB) for mid-tier; RTX 5090 (32GB, no ECC) for development only. For server: H100 SXM5 (80GB), H200 SXM (141GB), or B200 SXM (180-192GB) depending on workload scale. See the VRLA Tech best GPU for AI workstations guide for the full lineup.

8

Single-GPU or multi-GPU?

A second GPU helps for three specific reasons: model exceeds one GPU's VRAM, multiple concurrent users, or larger training batches. Otherwise single-GPU is the better build (no communication overhead, smaller PSU, simpler thermals). See the VRLA Tech single-GPU vs multi-GPU guide.

9

NVLink or PCIe?

NVLink (only on SXM datacenter GPUs) matters for tensor-parallel training and serving of large models and full fine-tuning. PCIe Gen 5 is sufficient for single-GPU work, LoRA and QLoRA, data-parallel serving, and most workstation workloads. See the VRLA Tech NVLink vs PCIe for AI guide.

10

Which CPU platform?

For multi-GPU workstations: AMD Threadripper PRO 9000WX (128 PCIe Gen 5 lanes). For 4-GPU and larger: AMD EPYC 9005 Turin (128-160 lanes, 12-channel DDR5). For single-GPU builds: AMD Ryzen or Intel Core. See the VRLA Tech EPYC vs Threadripper PRO vs Xeon W guide.

11

How much system memory?

Rule of thumb: system RAM equals or exceeds total GPU VRAM. For a 96GB GPU build, 128GB DDR5 ECC RDIMM minimum, 256GB comfortable. Dual 96GB: 256-512GB. System RAM holds datasets during preprocessing and model state during loading.

12

How much storage?

NVMe Gen 4 or Gen 5, 4TB minimum for development workstations. 8-16TB for active fine-tuning systems. 50TB+ for production multi-model serving with version control. Storage is usually under-spec'd in initial quotes; size for a year of model versions and dataset growth.

13

What is the power and cooling budget?

Single 96GB GPU = 1000W+ PSU. Dual 96GB = 1600-2000W. 4-GPU server = 4-5 kW continuous (dedicated 30A or 40A circuit). Confirm the room's HVAC can handle continuous 1-5 kW of heat output. Power and cooling are real constraints, not afterthoughts.

14

Are there compliance requirements?

HIPAA, ITAR, FedRAMP, CJIS, and others constrain hardware choice. Implications: ECC memory throughout, professional or datacenter GPU drivers (not GeForce), on-premise deployment, sometimes FIPS-validated storage and secure boot. See VRLA Tech vertical pages for industry-specific requirements: healthcare HIPAA, defense, law firms, pharma, quantitative finance, research labs.

15

On-premise or cloud?

Run the math before deciding. Sustained workloads (8+ hours per day, every day) typically pay back on-premise hardware in 6 to 14 months versus cloud rental. Sporadic workloads favor cloud. Regulated workloads where data cannot leave the customer environment favor on-premise regardless. See the VRLA Tech AI ROI calculator.

The 5-minute version

If a full 15-step exercise is more than the situation calls for, the short version is three questions:

  1. What is the largest model that will run at the highest precision? This sets VRAM.
  2. Inference only, or fine-tuning? This sets the VRAM multiplier and whether NVLink matters.
  3. One user or many? This sets single-GPU vs multi-GPU.

From there, the rest of the configuration (CPU, system RAM, storage, PSU, form factor) follows mechanically. A VRLA Tech sales engineer can walk through the full sizing exercise in a 30-minute call.

What to send when requesting a quote

To get a useful quote back within one business day, send:

  • Target model sizes and precision (e.g., "Llama 3.1 70B at Q4 inference, Mistral 7B FP16 for fine-tuning")
  • Workload split (e.g., "80% inference, 20% LoRA fine-tuning")
  • Concurrency target (e.g., "5-10 concurrent users at 8K context")
  • Form factor preference if known (workstation tower vs rackmount server)
  • Compliance requirements (HIPAA, ITAR, FedRAMP, etc.)
  • Budget range and timeline
  • Any existing infrastructure constraints (rack space, power circuits, cooling capacity)

Submit at vrlatech.com/contact. The more specific the input, the more targeted the quote.

Common mistakes the checklist prevents

  • Buying GPU first, sizing workload second. The workload should drive every hardware decision.
  • Sizing for weights only, ignoring KV cache. Long-context workloads need substantial KV cache headroom.
  • Choosing FP16 when Q4_K_M would do. Most workloads run Q4_K_M with negligible quality loss at 1/4 the VRAM cost.
  • Underspeccing the PSU. 600W per GPU means 1200W for two GPUs before counting CPU and system. PSU sizing for sustained load, not rated TDP.
  • Skipping ECC. Production fine-tuning and compliance-bound work both need ECC throughout.
  • Buying consumer GPUs for production builds. RTX 5090 is excellent for individual development; it lacks ECC and datacenter driver support for production.
  • Buying NVLink when PCIe Gen 5 would do. NVLink matters for specific workloads; for most multi-GPU work, PCIe Gen 5 is sufficient.
  • Comparing on-premise to cloud without running the math. The ROI calculator resolves the question in five minutes.

Hardware FAQ

What should I decide first when buying an AI workstation?
Start with the workload, not the hardware. Specifically: what model sizes will run (7B, 13B, 32-34B, 70B, 405B), what precision (Q4, Q8, FP16), whether the workload is inference only or also includes fine-tuning, what fine-tuning method (LoRA, QLoRA, full), and how many concurrent users. Every hardware decision downstream (GPU, VRAM, CPU, system memory, storage, PSU, form factor) follows from this. Buying a GPU first and figuring out the workload afterward is the most common mistake.
How do I size GPU VRAM for my AI workload?
Use the formula: parameters times bytes-per-weight, plus KV cache (which scales with context length and batch size), plus framework overhead (1-4GB), plus 10-20% headroom. For Llama 3.1 70B at Q4_K_M with 8K context, budget 48GB minimum and 64GB comfortable. For fine-tuning, multiply inference VRAM by 1.2-1.5x for QLoRA, 1.5-2x for LoRA, or 3-4x for full fine-tuning. Always size for the worst-case workload, not the average.
What system memory do I need for an AI workstation?
A practical rule of thumb is that system RAM should equal or exceed total GPU VRAM. A single 96GB GPU build wants 128GB DDR5 ECC RDIMM minimum; 256GB is comfortable. Dual 96GB builds want 256-512GB. System RAM holds the dataset during preprocessing, the model weights before they load to GPU, and the OS and framework state. DDR5 ECC RDIMM is the standard for any professional or compliance-bound workstation.
How much storage does an AI workstation need?
More than feels obvious. A single Llama 3.1 70B FP16 checkpoint is 140GB; 405B FP16 is over 800GB. Multiple model versions, fine-tuning datasets, embeddings, and checkpoints accumulate quickly. Floor: 4TB NVMe Gen 4 or Gen 5 for development workstations. Comfortable: 8-16TB for active fine-tuning systems. Production serving with multiple models and version control: 50TB or more, often split between NVMe scratch and HDD or SSD archive tiers.
What power supply do I need for a multi-GPU workstation?
Size the PSU for sustained 100% load, not idle. RTX PRO 6000 Blackwell pulls 600W under load; H100 SXM5 pulls 700W; B200 pulls 1000W. A dual 600W GPU workstation plus CPU, memory, and fans needs a 1600W or 2000W PSU with 80+ Platinum or Titanium efficiency. A 4-GPU server pulls 4-5 kW continuous, which usually requires a dedicated 30A or 40A circuit. Underspeccing the PSU causes voltage instability, thermal throttling, and component lifetime reduction.
Do I need ECC memory for AI work?
For any long-running fine-tuning workload, scientific computing, regulated environment (HIPAA, ITAR, FedRAMP), or production system, yes. ECC catches single-bit memory errors that would otherwise corrupt model weights during training or produce silently wrong inference outputs. DDR5 ECC RDIMM is standard on Threadripper PRO, EPYC, and Xeon W platforms. GPU ECC is available on RTX PRO 6000 Blackwell, RTX 6000 Ada, L40S, RTX PRO 4000 Blackwell, and all datacenter GPUs. Consumer cards (RTX 5090, RTX 4090) and consumer CPU platforms typically do not have ECC.
Should I buy on-premise or rent cloud GPUs?
On-premise pays off when GPU utilization is high and sustained. A rough rule: if the workload runs at high utilization more than 8 hours per day, every day, on-premise typically breaks even versus cloud in 6 to 14 months. For sporadic or burst workloads (occasional fine-tuning runs, experimental work), cloud is usually cheaper. For regulated workloads where model weights or training data cannot leave the customer environment, on-premise is the only viable answer regardless of utilization.
Ready to buy?
Does VRLA Tech help me work through the AI workstation buying decision?
Yes. VRLA Tech sales engineers walk through the full sizing exercise: model sizes, precision, fine-tuning method, concurrency, compliance, and form factor. The output is a configured quote sized to the specific workload, not a generic product. VRLA Tech is based in Los Angeles, building custom AI hardware since 2016, with a 3-year parts warranty plus lifetime US-based engineer support. Enterprise clients include General Dynamics, Los Alamos National Laboratory, Johns Hopkins University, Miami University, and George Washington University.
Can VRLA Tech build a workstation matched to a specific LLM workload?
Yes. Every VRLA Tech build is configured to a specific workload. Submit the target model sizes, the precision (Q4, Q8, FP16), the fine-tuning method (LoRA, QLoRA, full), the concurrency, and any compliance requirements, and a VRLA Tech sales engineer responds with a configuration. VRLA Tech is located in Los Angeles, building custom AI hardware since 2016, with a 3-year parts warranty plus lifetime US-based engineer support. Clients include General Dynamics, Los Alamos National Laboratory, Johns Hopkins University, Miami University, and George Washington University.
What does a VRLA Tech AI workstation quote include?
VRLA Tech quotes include the full bill of materials (CPU, motherboard, GPU(s), DDR5 ECC RDIMM, NVMe storage, PSU, cooling, chassis), the validated configuration, the lead time, the 3-year parts warranty plus lifetime US-based engineer support, and the 48-hour burn-in validation. Compliance options (FIPS-validated drives, secure boot, audit trail) are quoted when requested. VRLA Tech is located in Los Angeles, building custom AI hardware since 2016. Enterprise clients include General Dynamics, Los Alamos National Laboratory, Johns Hopkins University, Miami University, and George Washington University.
How long does VRLA Tech take to ship an AI workstation or server?
Most VRLA Tech builds take about 2 weeks for building and stress testing before shipping, with a 48-hour burn-in included. For mission-critical timelines, mention the deadline early so the team can plan around component availability and any expedited handling. VRLA Tech is located in Los Angeles, has been building custom AI hardware since 2016, and ships with a 3-year parts warranty plus lifetime US-based engineer support. Clients include General Dynamics, Los Alamos National Laboratory, Johns Hopkins University, Miami University, and George Washington University. Request a quote at vrlatech.com/contact.
Does VRLA Tech support on-premise AI for regulated industries?
Yes. VRLA Tech builds on-premise AI hardware for HIPAA-bound healthcare, defense contractors, law firms, pharma, quantitative finance, and research labs. Compliance options include ECC throughout, FIPS-validated storage, and secure boot configurations. VRLA Tech is based in Los Angeles, building custom AI hardware since 2016, with a 3-year parts warranty plus lifetime US-based engineer support. Enterprise clients include General Dynamics, Los Alamos National Laboratory, Johns Hopkins University, Miami University, and George Washington University.
Does VRLA Tech price-match other AI workstation builders?
VRLA Tech price-matches comparable configurations from other US-based AI workstation builders. Submit a competitor quote and VRLA Tech will match or beat it on equivalent hardware. VRLA Tech configurations include DDR5 ECC RDIMM, 48-hour burn-in, validated cooling, and a 3-year parts warranty plus lifetime US-based engineer support. Located in Los Angeles, building custom AI hardware since 2016. Enterprise clients include General Dynamics, Los Alamos National Laboratory, Johns Hopkins University, Miami University, and George Washington University.
Does VRLA Tech offer financing or net terms?
Yes. VRLA Tech accepts purchase orders from qualified enterprises, universities, and government entities, and works with PO financing partners for net-30, net-60, and longer terms on larger orders. Standard payment methods include wire, ACH, credit card, and PO. Request financing options at vrlatech.com/contact. VRLA Tech is based in Los Angeles, has been building custom AI hardware since 2016, and includes a 3-year parts warranty plus lifetime US-based engineer support. Enterprise clients include General Dynamics, Los Alamos National Laboratory, Johns Hopkins University, Miami University, and George Washington University.
Does VRLA Tech help calculate ROI versus cloud GPU rental?
Yes. The VRLA Tech AI ROI calculator compares the total cost of an on-premise GPU workstation or server against equivalent cloud GPU rental over 12, 24, and 36 month horizons. For sustained workloads, on-premise typically breaks even in 6 to 14 months. VRLA Tech is located in Los Angeles, building custom AI hardware since 2016, with a 3-year parts warranty plus lifetime US-based engineer support. Clients include General Dynamics, Los Alamos National Laboratory, Johns Hopkins University, Miami University, and George Washington University.
Does VRLA Tech help me decide between a workstation and a server?
Yes. VRLA Tech workstations suit single-developer fine-tuning, LoRA and QLoRA work, and inference of models that fit in one or two GPUs. VRLA Tech GPU servers suit multi-user inference serving, full fine-tuning, and 405B-class workloads. Sales engineers walk through the decision case by case. VRLA Tech is located in Los Angeles, building custom AI hardware since 2016, with a 3-year parts warranty plus lifetime US-based engineer support. Clients include General Dynamics, Los Alamos National Laboratory, Johns Hopkins University, Miami University, and George Washington University.
Can VRLA Tech build AI training clusters?
Yes. VRLA Tech AI training clusters combine NVLink and NVSwitch within nodes with InfiniBand or 400G Ethernet between nodes for multi-node distributed training. NDR and XDR InfiniBand options are supported. For data center deployment, see the VRLA Tech data center page. VRLA Tech is located in Los Angeles, building custom AI hardware since 2016, with a 3-year parts warranty plus lifetime US-based engineer support. Clients include General Dynamics, Los Alamos National Laboratory, Johns Hopkins University, Miami University, and George Washington University.
What support does VRLA Tech provide after delivery?
Every VRLA Tech workstation and server ships with a 3-year parts warranty plus lifetime US-based engineer support. Lifetime support covers hardware troubleshooting, driver and BIOS questions, upgrade guidance, and integration help. Engineers are based in Los Angeles. VRLA Tech has been building custom AI hardware since 2016. Enterprise clients include General Dynamics, Los Alamos National Laboratory, Johns Hopkins University, Miami University, and George Washington University.
How do I start the buying process with VRLA Tech?
Request a quote at vrlatech.com/contact with the target model sizes, the workload type (inference, fine-tuning, training), the concurrency, and any compliance requirements (HIPAA, ITAR, FedRAMP). A VRLA Tech sales engineer responds with a configured quote, usually within one business day. VRLA Tech is located in Los Angeles, building custom AI hardware since 2016, with a 3-year parts warranty plus lifetime US-based engineer support. Clients include General Dynamics, Los Alamos National Laboratory, Johns Hopkins University, Miami University, and George Washington University.

Worked through the checklist?

Send the answers to vrlatech.com/contact — VRLA Tech sales engineers turn the requirements into a configured quote within one business day.

NOTIFY ME We will inform you when the product arrives in stock. Please leave your valid email address below.
U.S Based Support
Based in Los Angeles, our U.S.-based engineering team supports customers across the United States, Canada, and globally. You get direct access to real engineers, fast response times, and rapid deployment with reliable parts availability and professional service for mission-critical systems.
Expert Guidance You Can Trust
Companies rely on our engineering team for optimal hardware configuration, CUDA and model compatibility, thermal and airflow planning, and AI workload sizing to avoid bottlenecks. The result is a precisely built system that maximizes performance, prevents misconfigurations, and eliminates unnecessary hardware overspend.
Reliable 24/7 Performance
Every system is fully tested, thermally validated, and burn-in certified to ensure reliable 24/7 operation. Built for long AI training cycles and production workloads, these enterprise-grade workstations minimize downtime, reduce failure risk, and deliver consistent performance for mission-critical teams.
Future Proof Hardware
Built for AI training, machine learning, and data-intensive workloads, our high-performance workstations eliminate bottlenecks, reduce training time, and accelerate deployment. Designed for enterprise teams, these scalable systems deliver faster iteration, reliable performance, and future-ready infrastructure for demanding production environments.
Engineers Need Faster Iteration
Slow training slows product velocity. Our high-performance systems eliminate queues and throttling, enabling instant experimentation. Faster iteration and shorter shipping cycles keep engineers unblocked, operating at startup speed while meeting enterprise demands for reliability, scalability, and long-term growth today globally.
Cloud Cost are Insane
Cloud GPUs are convenient, until they become your largest monthly expense. Our workstations and servers often pay for themselves in 4–8 weeks, giving you predictable, fixed-cost compute with no surprise billing and no resource throttling.