Mistral Medium 3.5 Released: What It Means for Self Hosted AI

Published April 29, 2026

Mistral AI released Medium 3.5 today. It’s a 128 billion parameter dense model with open weights, a 256k context window, and the ability to self host on as few as four GPUs. For teams running AI on their own infrastructure, this is a big deal. A flagship class model that’s actually deployable outside hyperscaler data centers.

Here’s what the release means, what hardware you need to run it, and where it fits with everything else out there.

Mistral Medium 3.5: Quick Specs

SpecificationDetail
Parameters128B (dense)
Context Window256k tokens
LicenseModified MIT (open weights)
Self Host Minimum4 GPUs
API Pricing$1.50 per 1M input tokens, $7.50 per 1M output tokens
SWE Bench Verified Score77.6%
Vision SupportYes (custom trained vision encoder)
ReasoningConfigurable per request
AvailabilityHugging Face, Le Chat, Mistral Vibe, NVIDIA NIM

Source: Mistral AI official announcement

Why This Release Matters

For the past two years, the highest performing open weight models have needed eight or more high end GPUs to self host at full precision. That hardware bar pushed teams toward cloud APIs whether they wanted to be there or not, especially for production workloads.

Mistral Medium 3.5 changes that. A 128B dense model running on four GPUs is a multi GPU server build, not a hyperscaler only deployment. Teams running their own infrastructure can now host a model that scores 77.6% on SWE Bench Verified, ahead of bigger competitors like Qwen3.5 397B A17B.

The bottom line: A 128B parameter model with 256k context and strong coding performance is now deployable on a single multi GPU server. Not a rack of them.

What this means in practice:

  • Teams worried about data privacy can keep AI workloads on premise
  • Production inference costs become a hardware decision instead of a per token expense
  • Fine tuning and customization stay fully under your control
  • Latency drops because requests don’t leave the local network

Hardware Requirements: What You Actually Need

Mistral says Medium 3.5 can self host on “as few as four GPUs,” but they don’t say which GPUs in the announcement. Based on the model size and standard precision options, here’s what that practically means.

At FP8 Precision (Recommended for Production Inference)

A 128B dense model at FP8 needs around 128GB of VRAM for the weights, plus more headroom for KV cache and context. Four GPUs at 80GB each (320GB total) gives you room for production deployment with reasonable batch sizes and the full 256k context.

Suitable configurations:

  • 4x NVIDIA H100 80GB
  • 4x NVIDIA H200 141GB
  • 4x NVIDIA RTX PRO 6000 Blackwell 96GB

At BF16 Precision (Higher Accuracy, More VRAM)

At BF16, the model needs roughly 256GB of VRAM for the weights alone. That usually means 4x H200 (141GB each, 564GB total) or moving to a bigger configuration.

Server Considerations Beyond GPUs

Running Medium 3.5 in production isn’t just about GPU count. The full stack matters:

  • CPU: Multi core server CPU (Intel Xeon, AMD EPYC, or Threadripper PRO) for data preprocessing, request handling, and tool orchestration
  • RAM: 256GB or more system memory for large context handling and concurrent users
  • Storage: NVMe SSDs for model loading, KV cache offloading, and dataset access
  • Networking: 10GbE minimum if the model is serving multiple users or feeding downstream systems
  • Power and Cooling: 4x H100 or H200 systems can pull 2,500W+ under sustained load and need proper thermal design

Mistral Medium 3.5 vs Other Open Weight Models

ModelParametersLicenseContextSelf Host Floor
Mistral Medium 3.5128B denseModified MIT256k4 GPUs
Mistral Large 3675B MoE (41B active)Apache 2.0256k8+ GPUs
Mistral Small 4119B MoE (6B active)Apache 2.0128k2 GPUs
Llama 3.1 405B405B denseLlama license128k8+ GPUs
Qwen3.5 397B A17B397B MoEApache 2.0128k8+ GPUs

Medium 3.5 sits in an interesting middle ground. Smaller than Large 3, but higher performing on coding benchmarks (77.6% on SWE Bench Verified ahead of competitors).

What’s New in This Release

The Medium 3.5 announcement included more than just model weights:

  • Mistral Vibe Remote Agents: Coding agents that run async in the cloud, can be spawned from CLI or Le Chat, and produce finished pull requests on GitHub.
  • Le Chat Work Mode: A new agentic mode for cross tool workflows including email triage, research synthesis, and multi step project execution.
  • Configurable Reasoning: Reasoning effort can be set per request, so the same model handles both quick replies and extended chain of thought tasks without separate deployments.
  • Vision Support: The model includes a vision encoder trained from scratch to handle variable image sizes and aspect ratios.

How the Vibe Agentic Runtime Connects

Vibe Agentic Runtime

Version Control

Vibe CLI

Issue Tracking

Monitoring & Observability

ChatOps

Vibe Agentic Runtime

Pull Request / Artifacts

Documentation

Reporting

Human In The Loop

Diagram concept based on Mistral AI’s announcement.

Frequently Asked Questions

What is Mistral Medium 3.5?

Mistral Medium 3.5 is a 128 billion parameter dense language model released by Mistral AI on April 29, 2026. It’s the company’s first flagship merged model, combining instruction following, reasoning, and coding in a single set of weights. It’s available as open weights under a modified MIT license.

Can I run Mistral Medium 3.5 on my own hardware?

Yes. Mistral says the model can be self hosted on as few as four GPUs. The practical minimum is four high VRAM data center GPUs like the NVIDIA H100 80GB or H200 141GB. The model weights are on Hugging Face under a modified MIT license.

How much VRAM does Mistral Medium 3.5 need?

At FP8 precision, the 128B model needs about 128GB of VRAM for weights, plus more for KV cache and context. A four GPU configuration with 80GB per GPU (320GB total) gives enough headroom for production inference with the full 256k context window.

How does Mistral Medium 3.5 compare to GPT 4 or Claude?

Mistral Medium 3.5 scores 77.6% on SWE Bench Verified, a coding capability benchmark. Mistral says this is ahead of Devstral 2 and Qwen3.5 397B A17B. Direct comparisons with closed models like GPT 4 and Claude haven’t been independently published yet. The key practical difference: Medium 3.5 is open weight and can be self hosted, while GPT 4 and Claude are API only.

What is the context window for Mistral Medium 3.5?

The model supports a 256,000 token context window. Good for long documents, large codebases, and extended agentic conversations.

Is Mistral Medium 3.5 free?

The open weights are free to download from Hugging Face under a modified MIT license. Self hosting requires hardware investment. The Mistral API costs $1.50 per million input tokens and $7.50 per million output tokens. Using the model through Le Chat requires a Pro, Team, or Enterprise plan.

What’s the difference between Mistral Medium 3.5 and Mistral Large 3?

Mistral Large 3 is a 675 billion parameter MoE (Mixture of Experts) model with 41B active parameters per forward pass. Mistral Medium 3.5 is a 128B dense model. Large 3 is bigger and needs more hardware to self host, while Medium 3.5 is more accessible at four GPUs.

What hardware do I need to fine tune Mistral Medium 3.5?

Fine tuning needs more VRAM than inference. For full fine tuning of a 128B model, expect to need 8 or more high VRAM GPUs (H100 80GB or H200 141GB class). LoRA or QLoRA fine tuning can drop this requirement quite a bit, potentially fitting in a four GPU configuration with proper optimization.

Can Mistral Medium 3.5 be used commercially?

Yes. The modified MIT license allows commercial use of the model weights. Review the specific license terms on Mistral’s Hugging Face repository before deployment.

Where can I download Mistral Medium 3.5?

The open weights are on Hugging Face. The model is also available through Mistral’s API, Le Chat, Mistral Vibe, and as an NVIDIA NIM containerized inference microservice.

What This Means for Teams Running On Prem AI

Medium 3.5’s hardware requirements put it in reach of organizations that have invested in real AI infrastructure but aren’t operating at hyperscaler scale. A four GPU server build is achievable for:

  • Research labs running internal tooling and experiments
  • Engineering teams deploying AI assisted coding agents
  • Companies with data privacy requirements that prevent cloud deployment
  • Studios and creative teams running AI workflows alongside traditional production

The math has shifted. A 128B parameter model with 256k context and competitive coding performance is now deployable on a single multi GPU server.

Building hardware for Mistral Medium 3.5 and other local AI workloads

VRLA Tech builds custom workstations and servers configured for self hosted AI workloads, including multi GPU configurations capable of running models like Mistral Medium 3.5. We work with teams across AI/ML, research, and enterprise to spec hardware that fits specific deployment requirements.

Explore AI Workstations

Source: Mistral AI’s official announcement, “Remote agents in Vibe. Powered by Mistral Medium 3.5.” Published April 29, 2026. Read the original announcement.

Leave a Reply

Your email address will not be published. Required fields are marked *

NOTIFY ME We will inform you when the product arrives in stock. Please leave your valid email address below.
U.S Based Support
Based in Los Angeles, our U.S.-based engineering team supports customers across the United States, Canada, and globally. You get direct access to real engineers, fast response times, and rapid deployment with reliable parts availability and professional service for mission-critical systems.
Expert Guidance You Can Trust
Companies rely on our engineering team for optimal hardware configuration, CUDA and model compatibility, thermal and airflow planning, and AI workload sizing to avoid bottlenecks. The result is a precisely built system that maximizes performance, prevents misconfigurations, and eliminates unnecessary hardware overspend.
Reliable 24/7 Performance
Every system is fully tested, thermally validated, and burn-in certified to ensure reliable 24/7 operation. Built for long AI training cycles and production workloads, these enterprise-grade workstations minimize downtime, reduce failure risk, and deliver consistent performance for mission-critical teams.
Future Proof Hardware
Built for AI training, machine learning, and data-intensive workloads, our high-performance workstations eliminate bottlenecks, reduce training time, and accelerate deployment. Designed for enterprise teams, these scalable systems deliver faster iteration, reliable performance, and future-ready infrastructure for demanding production environments.
Engineers Need Faster Iteration
Slow training slows product velocity. Our high-performance systems eliminate queues and throttling, enabling instant experimentation. Faster iteration and shorter shipping cycles keep engineers unblocked, operating at startup speed while meeting enterprise demands for reliability, scalability, and long-term growth today globally.
Cloud Cost are Insane
Cloud GPUs are convenient, until they become your largest monthly expense. Our workstations and servers often pay for themselves in 4–8 weeks, giving you predictable, fixed-cost compute with no surprise billing and no resource throttling.