Mistral Medium 3.5 Released: What It Means for Self Hosted AI

Published April 29, 2026

Mistral AI released Medium 3.5 today. It’s a 128 billion parameter dense model with open weights, a 256k context window, and the ability to self host on as few as four GPUs. For teams running AI on their own infrastructure, this is a big deal. A flagship class model that’s actually deployable outside hyperscaler data centers.

Here’s what the release means, what hardware you need to run it, and where it fits with everything else out there.

Mistral Medium 3.5: Quick Specs

Specification	Detail
Parameters	128B (dense)
Context Window	256k tokens
License	Modified MIT (open weights)
Self Host Minimum	4 GPUs
API Pricing	$1.50 per 1M input tokens, $7.50 per 1M output tokens
SWE Bench Verified Score	77.6%
Vision Support	Yes (custom trained vision encoder)
Reasoning	Configurable per request
Availability	Hugging Face, Le Chat, Mistral Vibe, NVIDIA NIM

Source: Mistral AI official announcement

Why This Release Matters

For the past two years, the highest performing open weight models have needed eight or more high end GPUs to self host at full precision. That hardware bar pushed teams toward cloud APIs whether they wanted to be there or not, especially for production workloads.

Mistral Medium 3.5 changes that. A 128B dense model running on four GPUs is a multi GPU server build, not a hyperscaler only deployment. Teams running their own infrastructure can now host a model that scores 77.6% on SWE Bench Verified, ahead of bigger competitors like Qwen3.5 397B A17B.

The bottom line: A 128B parameter model with 256k context and strong coding performance is now deployable on a single multi GPU server. Not a rack of them.

What this means in practice:

Teams worried about data privacy can keep AI workloads on premise
Production inference costs become a hardware decision instead of a per token expense
Fine tuning and customization stay fully under your control
Latency drops because requests don’t leave the local network

Hardware Requirements: What You Actually Need

Mistral says Medium 3.5 can self host on “as few as four GPUs,” but they don’t say which GPUs in the announcement. Based on the model size and standard precision options, here’s what that practically means.

At FP8 Precision (Recommended for Production Inference)

A 128B dense model at FP8 needs around 128GB of VRAM for the weights, plus more headroom for KV cache and context. Four GPUs at 80GB each (320GB total) gives you room for production deployment with reasonable batch sizes and the full 256k context.

Suitable configurations:

4x NVIDIA H100 80GB
4x NVIDIA H200 141GB
4x NVIDIA RTX PRO 6000 Blackwell 96GB

At BF16 Precision (Higher Accuracy, More VRAM)

At BF16, the model needs roughly 256GB of VRAM for the weights alone. That usually means 4x H200 (141GB each, 564GB total) or moving to a bigger configuration.

Server Considerations Beyond GPUs

Running Medium 3.5 in production isn’t just about GPU count. The full stack matters:

CPU: Multi core server CPU (Intel Xeon, AMD EPYC, or Threadripper PRO) for data preprocessing, request handling, and tool orchestration
RAM: 256GB or more system memory for large context handling and concurrent users
Storage: NVMe SSDs for model loading, KV cache offloading, and dataset access
Networking: 10GbE minimum if the model is serving multiple users or feeding downstream systems
Power and Cooling: 4x H100 or H200 systems can pull 2,500W+ under sustained load and need proper thermal design

Mistral Medium 3.5 vs Other Open Weight Models

Model	Parameters	License	Context	Self Host Floor
Mistral Medium 3.5	128B dense	Modified MIT	256k	4 GPUs
Mistral Large 3	675B MoE (41B active)	Apache 2.0	256k	8+ GPUs
Mistral Small 4	119B MoE (6B active)	Apache 2.0	128k	2 GPUs
Llama 3.1 405B	405B dense	Llama license	128k	8+ GPUs
Qwen3.5 397B A17B	397B MoE	Apache 2.0	128k	8+ GPUs

Medium 3.5 sits in an interesting middle ground. Smaller than Large 3, but higher performing on coding benchmarks (77.6% on SWE Bench Verified ahead of competitors).

What’s New in This Release

The Medium 3.5 announcement included more than just model weights:

Mistral Vibe Remote Agents: Coding agents that run async in the cloud, can be spawned from CLI or Le Chat, and produce finished pull requests on GitHub.
Le Chat Work Mode: A new agentic mode for cross tool workflows including email triage, research synthesis, and multi step project execution.
Configurable Reasoning: Reasoning effort can be set per request, so the same model handles both quick replies and extended chain of thought tasks without separate deployments.
Vision Support: The model includes a vision encoder trained from scratch to handle variable image sizes and aspect ratios.

How the Vibe Agentic Runtime Connects

Vibe Agentic Runtime

Version Control

Vibe CLI

Issue Tracking

Monitoring & Observability

ChatOps

→

Vibe Agentic Runtime

→

Pull Request / Artifacts

Documentation

Reporting

↑

Human In The Loop

Diagram concept based on Mistral AI’s announcement.

Frequently Asked Questions

What is Mistral Medium 3.5?

Mistral Medium 3.5 is a 128 billion parameter dense language model released by Mistral AI on April 29, 2026. It’s the company’s first flagship merged model, combining instruction following, reasoning, and coding in a single set of weights. It’s available as open weights under a modified MIT license.

Can I run Mistral Medium 3.5 on my own hardware?

Yes. Mistral says the model can be self hosted on as few as four GPUs. The practical minimum is four high VRAM data center GPUs like the NVIDIA H100 80GB or H200 141GB. The model weights are on Hugging Face under a modified MIT license.

How much VRAM does Mistral Medium 3.5 need?

At FP8 precision, the 128B model needs about 128GB of VRAM for weights, plus more for KV cache and context. A four GPU configuration with 80GB per GPU (320GB total) gives enough headroom for production inference with the full 256k context window.

How does Mistral Medium 3.5 compare to GPT 4 or Claude?

Mistral Medium 3.5 scores 77.6% on SWE Bench Verified, a coding capability benchmark. Mistral says this is ahead of Devstral 2 and Qwen3.5 397B A17B. Direct comparisons with closed models like GPT 4 and Claude haven’t been independently published yet. The key practical difference: Medium 3.5 is open weight and can be self hosted, while GPT 4 and Claude are API only.

What is the context window for Mistral Medium 3.5?

The model supports a 256,000 token context window. Good for long documents, large codebases, and extended agentic conversations.

Is Mistral Medium 3.5 free?

The open weights are free to download from Hugging Face under a modified MIT license. Self hosting requires hardware investment. The Mistral API costs $1.50 per million input tokens and $7.50 per million output tokens. Using the model through Le Chat requires a Pro, Team, or Enterprise plan.

What’s the difference between Mistral Medium 3.5 and Mistral Large 3?

Mistral Large 3 is a 675 billion parameter MoE (Mixture of Experts) model with 41B active parameters per forward pass. Mistral Medium 3.5 is a 128B dense model. Large 3 is bigger and needs more hardware to self host, while Medium 3.5 is more accessible at four GPUs.

What hardware do I need to fine tune Mistral Medium 3.5?

Fine tuning needs more VRAM than inference. For full fine tuning of a 128B model, expect to need 8 or more high VRAM GPUs (H100 80GB or H200 141GB class). LoRA or QLoRA fine tuning can drop this requirement quite a bit, potentially fitting in a four GPU configuration with proper optimization.

Can Mistral Medium 3.5 be used commercially?

Yes. The modified MIT license allows commercial use of the model weights. Review the specific license terms on Mistral’s Hugging Face repository before deployment.

Where can I download Mistral Medium 3.5?

The open weights are on Hugging Face. The model is also available through Mistral’s API, Le Chat, Mistral Vibe, and as an NVIDIA NIM containerized inference microservice.

What This Means for Teams Running On Prem AI

Medium 3.5’s hardware requirements put it in reach of organizations that have invested in real AI infrastructure but aren’t operating at hyperscaler scale. A four GPU server build is achievable for:

Research labs running internal tooling and experiments
Engineering teams deploying AI assisted coding agents
Companies with data privacy requirements that prevent cloud deployment
Studios and creative teams running AI workflows alongside traditional production

The math has shifted. A 128B parameter model with 256k context and competitive coding performance is now deployable on a single multi GPU server.

Building hardware for Mistral Medium 3.5 and other local AI workloads

VRLA Tech builds custom workstations and servers configured for self hosted AI workloads, including multi GPU configurations capable of running models like Mistral Medium 3.5. We work with teams across AI/ML, research, and enterprise to spec hardware that fits specific deployment requirements.

Explore AI Workstations

Source: Mistral AI’s official announcement, “Remote agents in Vibe. Powered by Mistral Medium 3.5.” Published April 29, 2026. Read the original announcement.

Gaming PCs

Custom Gaming PCs

Special Systems

Accessories

CPU Platforms

OEM Workstations

Creative Workflows

3D / ANIMATION

RENDERING

Real-Time Engines

Engineering / GIS

VRLA Servers

Dell Servers

GPU Servers

HPE Servers

Lenovo Servers

Gaming PCs

BUILD YOUR PC

Special Systems

Accessories

SUPPORT

Cart review