Mistral Medium 3.5 Released: What It Means for Self Hosted AI
Published April 29, 2026
Mistral AI released Medium 3.5 today. It’s a 128 billion parameter dense model with open weights, a 256k context window, and the ability to self host on as few as four GPUs. For teams running AI on their own infrastructure, this is a big deal. A flagship class model that’s actually deployable outside hyperscaler data centers.
Here’s what the release means, what hardware you need to run it, and where it fits with everything else out there.
Mistral Medium 3.5: Quick Specs
Source: Mistral AI official announcement
Why This Release Matters
For the past two years, the highest performing open weight models have needed eight or more high end GPUs to self host at full precision. That hardware bar pushed teams toward cloud APIs whether they wanted to be there or not, especially for production workloads.
Mistral Medium 3.5 changes that. A 128B dense model running on four GPUs is a multi GPU server build, not a hyperscaler only deployment. Teams running their own infrastructure can now host a model that scores 77.6% on SWE Bench Verified, ahead of bigger competitors like Qwen3.5 397B A17B.
The bottom line: A 128B parameter model with 256k context and strong coding performance is now deployable on a single multi GPU server. Not a rack of them.
What this means in practice:
- Teams worried about data privacy can keep AI workloads on premise
- Production inference costs become a hardware decision instead of a per token expense
- Fine tuning and customization stay fully under your control
- Latency drops because requests don’t leave the local network
Hardware Requirements: What You Actually Need
Mistral says Medium 3.5 can self host on “as few as four GPUs,” but they don’t say which GPUs in the announcement. Based on the model size and standard precision options, here’s what that practically means.
At FP8 Precision (Recommended for Production Inference)
A 128B dense model at FP8 needs around 128GB of VRAM for the weights, plus more headroom for KV cache and context. Four GPUs at 80GB each (320GB total) gives you room for production deployment with reasonable batch sizes and the full 256k context.
Suitable configurations:
- 4x NVIDIA H100 80GB
- 4x NVIDIA H200 141GB
- 4x NVIDIA RTX PRO 6000 Blackwell 96GB
At BF16 Precision (Higher Accuracy, More VRAM)
At BF16, the model needs roughly 256GB of VRAM for the weights alone. That usually means 4x H200 (141GB each, 564GB total) or moving to a bigger configuration.
Server Considerations Beyond GPUs
Running Medium 3.5 in production isn’t just about GPU count. The full stack matters:
- CPU: Multi core server CPU (Intel Xeon, AMD EPYC, or Threadripper PRO) for data preprocessing, request handling, and tool orchestration
- RAM: 256GB or more system memory for large context handling and concurrent users
- Storage: NVMe SSDs for model loading, KV cache offloading, and dataset access
- Networking: 10GbE minimum if the model is serving multiple users or feeding downstream systems
- Power and Cooling: 4x H100 or H200 systems can pull 2,500W+ under sustained load and need proper thermal design
Mistral Medium 3.5 vs Other Open Weight Models
Medium 3.5 sits in an interesting middle ground. Smaller than Large 3, but higher performing on coding benchmarks (77.6% on SWE Bench Verified ahead of competitors).
What’s New in This Release
The Medium 3.5 announcement included more than just model weights:
- Mistral Vibe Remote Agents: Coding agents that run async in the cloud, can be spawned from CLI or Le Chat, and produce finished pull requests on GitHub.
- Le Chat Work Mode: A new agentic mode for cross tool workflows including email triage, research synthesis, and multi step project execution.
- Configurable Reasoning: Reasoning effort can be set per request, so the same model handles both quick replies and extended chain of thought tasks without separate deployments.
- Vision Support: The model includes a vision encoder trained from scratch to handle variable image sizes and aspect ratios.
How the Vibe Agentic Runtime Connects
Vibe Agentic Runtime
Version Control Vibe CLI Issue Tracking Monitoring & Observability ChatOps | → | Vibe Agentic Runtime | → | Pull Request / Artifacts Documentation Reporting |
↑
Human In The Loop
Diagram concept based on Mistral AI’s announcement.
Frequently Asked Questions
What is Mistral Medium 3.5?
Mistral Medium 3.5 is a 128 billion parameter dense language model released by Mistral AI on April 29, 2026. It’s the company’s first flagship merged model, combining instruction following, reasoning, and coding in a single set of weights. It’s available as open weights under a modified MIT license.
Can I run Mistral Medium 3.5 on my own hardware?
Yes. Mistral says the model can be self hosted on as few as four GPUs. The practical minimum is four high VRAM data center GPUs like the NVIDIA H100 80GB or H200 141GB. The model weights are on Hugging Face under a modified MIT license.
How much VRAM does Mistral Medium 3.5 need?
At FP8 precision, the 128B model needs about 128GB of VRAM for weights, plus more for KV cache and context. A four GPU configuration with 80GB per GPU (320GB total) gives enough headroom for production inference with the full 256k context window.
How does Mistral Medium 3.5 compare to GPT 4 or Claude?
Mistral Medium 3.5 scores 77.6% on SWE Bench Verified, a coding capability benchmark. Mistral says this is ahead of Devstral 2 and Qwen3.5 397B A17B. Direct comparisons with closed models like GPT 4 and Claude haven’t been independently published yet. The key practical difference: Medium 3.5 is open weight and can be self hosted, while GPT 4 and Claude are API only.
What is the context window for Mistral Medium 3.5?
The model supports a 256,000 token context window. Good for long documents, large codebases, and extended agentic conversations.
Is Mistral Medium 3.5 free?
The open weights are free to download from Hugging Face under a modified MIT license. Self hosting requires hardware investment. The Mistral API costs $1.50 per million input tokens and $7.50 per million output tokens. Using the model through Le Chat requires a Pro, Team, or Enterprise plan.
What’s the difference between Mistral Medium 3.5 and Mistral Large 3?
Mistral Large 3 is a 675 billion parameter MoE (Mixture of Experts) model with 41B active parameters per forward pass. Mistral Medium 3.5 is a 128B dense model. Large 3 is bigger and needs more hardware to self host, while Medium 3.5 is more accessible at four GPUs.
What hardware do I need to fine tune Mistral Medium 3.5?
Fine tuning needs more VRAM than inference. For full fine tuning of a 128B model, expect to need 8 or more high VRAM GPUs (H100 80GB or H200 141GB class). LoRA or QLoRA fine tuning can drop this requirement quite a bit, potentially fitting in a four GPU configuration with proper optimization.
Can Mistral Medium 3.5 be used commercially?
Yes. The modified MIT license allows commercial use of the model weights. Review the specific license terms on Mistral’s Hugging Face repository before deployment.
Where can I download Mistral Medium 3.5?
The open weights are on Hugging Face. The model is also available through Mistral’s API, Le Chat, Mistral Vibe, and as an NVIDIA NIM containerized inference microservice.
What This Means for Teams Running On Prem AI
Medium 3.5’s hardware requirements put it in reach of organizations that have invested in real AI infrastructure but aren’t operating at hyperscaler scale. A four GPU server build is achievable for:
- Research labs running internal tooling and experiments
- Engineering teams deploying AI assisted coding agents
- Companies with data privacy requirements that prevent cloud deployment
- Studios and creative teams running AI workflows alongside traditional production
The math has shifted. A 128B parameter model with 256k context and competitive coding performance is now deployable on a single multi GPU server.
Building hardware for Mistral Medium 3.5 and other local AI workloads
VRLA Tech builds custom workstations and servers configured for self hosted AI workloads, including multi GPU configurations capable of running models like Mistral Medium 3.5. We work with teams across AI/ML, research, and enterprise to spec hardware that fits specific deployment requirements.
Source: Mistral AI’s official announcement, “Remote agents in Vibe. Powered by Mistral Medium 3.5.” Published April 29, 2026. Read the original announcement.




