What hardware do I need to run Mistral Small 4 119B in production?

Mistral Small 4 (mistralai/Mistral-Small-4-119B-2603) requires approximately 238 GB VRAM in BF16 precision. A cluster of three or four A100-80GB or H100-80GB GPUs handles inference with tensor parallelism. The official NVFP4 quantized checkpoint (mistralai/Mistral-Small-4-119B-2603-NVFP4) reduces memory to roughly 60–80 GB, fitting two A100s. For latency-sensitive workloads, pair with the eagle speculative-decoding variant to double throughput on supported hardware.

Is Mistral Large 3 675B open-weight or API-only?

Mistral Large 3 (mistralai/Mistral-Large-3-675B-Instruct-2512) weights are published on Hugging Face as of January 2026, making it open-weight rather than API-only. However, the model ships under the Mistral Research License, permitting academic and research use but requiring a commercial agreement for production deployment. In BF16 the checkpoint exceeds 1.3 TB; the accompanying NVFP4 quantization shrinks this to approximately 340 GB, enabling eight-GPU H100 clusters for self-hosting.

How does Devstral 2 compare to Llama 4 and Qwen 3.5 for code generation?

Devstral 2 offers two checkpoints: 123B and Small 24B. The 24B variant (mistralai/Devstral-Small-2-24B-Instruct-2512) fits single A100 or L40S GPUs and matches or exceeds Llama 4 70B on HumanEval and MBPP benchmarks while using half the memory. Qwen 3.5 Coder 32B remains competitive on multi-file repositories, but Devstral Small 2's February 2026 training data and function-calling schema give it an edge for API scaffolding and recent framework support.

Which Ministral 3 model should I choose for on-device mobile deployment?

Ministral 3 ships in three sizes: 3B, 8B, and 14B. The 3B variant (mistralai/Ministral-3-3B-Instruct-2512-BF16) targets smartphones and edge hardware with 6 GB RAM; GGUF and ONNX exports enable sub-2-second first-token latency on Snapdragon 8 Gen 3 and Apple A17 silicon. The 8B checkpoint doubles accuracy on instruction-following tasks while remaining runnable on higher-end mobile devices. The 14B model requires laptop-class hardware but competes with server-hosted 30B alternatives for offline RAG workflows.

What is the Mistral Research License and how does it differ from Apache 2.0?

Apache 2.0 grants unrestricted commercial use, modification, and redistribution. Most Mistral 2026 models—including Ministral 3, Devstral Small 2, and Voxtral—ship under Apache 2.0. The Mistral Research License restricts commercial deployment without a separate agreement, permitting only academic research and internal experimentation. Mistral Large 3 and select flagships use this license. Always verify the license file on the Hugging Face model card before production deployment to avoid compliance issues.

Can Voxtral handle real-time voice conversations or only text-to-speech?

Voxtral comprises two checkpoints: Voxtral Mini 4B Realtime (mistralai/Voxtral-Mini-4B-Realtime-2602) for duplex voice interaction with sub-200ms latency, and Voxtral 4B TTS (mistralai/Voxtral-4B-TTS-2603) for high-fidelity speech synthesis only. The realtime variant processes interleaved audio and text tokens end-to-end, enabling interruptions and turn-taking without separate VAD or STT pipelines. The TTS model offers 24 kHz output with prosody control but requires external ASR for two-way conversations. Both fit consumer GPUs with 16 GB VRAM.

Models

Mistral 2026 Lineup Complete Guide: Large 3, Small 4, Devstral 2, Ministral 3, Voxtral

Markos Symeonides

May 10, 2026

⚡ The Brief

Mistral Large 3 (675B dense flagship) and Small 4 (119B, released April 2026) anchor the 2026 production lineup for diverse AI workloads.
Devstral Small 2 (24B) leads in downloads with over 308,000, highlighting the community’s focus on specialized coding models for self-hosting.
Ministral 3 provides on-device inference models at 3B, 8B, and 14B parameter scales, optimized with BF16 and GGUF quantizations for edge deployments.
Voxtral Mini (4B realtime) surpassed one million downloads, becoming the most popular voice model in the Mistral ecosystem.
Most weights are under the permissive Apache 2.0 license, while flagship models use the Mistral Research License requiring commercial negotiation.

[IMAGE_PLACEHOLDER_HEADER]

✦
Get 40K Prompts, Guides & Tools — Free
→

✓ Instant access✓ No spam✓ Unsubscribe anytime

Mistral’s 2026 lineup presents a coherent ecosystem of large dense models, specialized variants, and edge-optimized on-device offerings that cater to a broad range of AI applications. Moving away from earlier sparse MoE approaches like Mixtral, this family emphasizes simplicity, specialization, and scalability across deployment scenarios — from cluster-scale reasoning to real-time voice and mobile inference.

Comprehensive Overview of the Mistral 2026 Lineup

The 2026 Mistral family is structured around three main axes that address different AI demands:

Flagship Dense Models: Including Mistral Large 3 (675B parameters) delivering top-tier reasoning and general-purpose capabilities.
Mid-sized Dense Workhorses: Mistral Small 4 (119B parameters) balances performance and scalability for production workloads.
Specialists & Edge Models: Devstral 2 excels at coding, Magistral Small targets reasoning, Ministral 3 supports on-device inference, and Voxtral leads voice applications.

This lineup replaces the legacy Mixtral MoE 8x22B and 8x7B models with a dense and specialist model strategy, prioritizing deployment simplicity and domain-specific performance.

As part of the broader open-source AI ecosystem, Mistral’s models are extensively hosted on Hugging Face, supporting self-hosting with detailed licensing and hardware guidance. If you’re building a production AI stack or exploring self-hosted large language models, understanding this lineup’s capabilities and constraints is essential.

[IMAGE_PLACEHOLDER_SECTION_1]

Detailed Analysis of Flagship and Mid-Tier Models

Mistral Large 3 (675B Dense Flagship)

Model ID: mistralai/Mistral-Large-3-675B-Instruct-2512

The Mistral Large 3 model is the pinnacle of the 2026 open-weight lineup, featuring 675 billion dense parameters without sparse MoE routing. This model provides:

Unparalleled reasoning and generalist AI quality for complex, multi-step workflows requiring minimal external tooling.
Cluster-scale infrastructure requirements: It demands multi-node GPU clusters with advanced parallelism (tensor + pipeline) due to its massive size.
Research-grade evaluation potential: Comparable to top-tier proprietary models such as GPT-5.4-pro or Claude Opus 4.7, suitable for internal R&D.

Hardware & Deployment Notes: Even with NVFP4 quantization, the model requires 8–16 high-memory GPUs (e.g., NVIDIA H100 80GB) and sophisticated sharding strategies. It is targeted at organizations with existing GPU clusters and mature MLOps capabilities.

Competitive Context:

Vs. Llama 4 Scout/Maverick: Mistral Large 3 opts for full dense capacity favoring quality over MoE’s parameter efficiency.
Vs. DeepSeek V4 Pro: DeepSeek emphasizes efficiency; Mistral Large 3 is a “bring all the GPUs” flagship prioritizing control.
Vs. Qwen 3.5/3.6: Qwen’s largest models are smaller; Large 3 occupies a distinct tier requiring cluster-scale resources.

Overall, Large 3 is best suited for organizations prepared to invest heavily in infrastructure to maximize AI quality and control.

Mistral Small 4 (119B, Released April 2026)

Model IDs:

mistralai/Mistral-Small-4-119B-2603 – Base dense 119B parameters
mistralai/Mistral-Small-4-119B-2603-eagle – Speculative decoding partner model
mistralai/Mistral-Small-4-119B-2603-NVFP4 – Quantized NVFP4 variant

Mistral Small 4 strikes a pragmatic balance between performance, cost, and deployment complexity:

Capable of production chat, agent workflows, knowledge-intensive tasks, and light coding.
NVFP4 quantization greatly reduces VRAM usage, enabling deployment on 2–4 GPUs.
Speculative decoding with the eagle model boosts throughput and lowers latency in high-demand scenarios.

Deployment Considerations: Running the full-precision model requires >200 GB VRAM, typically spread across multiple A100 or H100 GPUs. The NVFP4 variant and eagle pairing make it more accessible for mid-sized organizations.

Comparison with peers:

Llama 4 Scout/Maverick: Smaller MoE models optimized for effective capacity, but Small 4 offers simpler dense deployment.
Qwen 3.6 (27B/35B-A3B): Smaller and cheaper but less capable for heavier reasoning.
DeepSeek V4 Flash: Focuses on throughput, while Small 4 balances quality and cost.

Mistral Small 4 is the recommended generalist for most production self-hosted AI needs.

[INTERNAL_LINK]

Specialized Models: Devstral, Magistral, Ministral, and Voxtral

Devstral 2: Coding Specialists at 123B and 24B

Model IDs:

mistralai/Devstral-2-123B-Instruct-2512 – High-end 123B coding specialist
mistralai/Devstral-Small-2-24B-Instruct-2512 – Practical 24B coding model with 308,984+ downloads

Devstral 2 is tailored for code understanding, generation, and assistance:

Supports inline code completion, refactoring, documentation, and code review bots.
Enables on-premises coding assistants for compliance and privacy.
24B variant is widely adopted for single-GPU setups, balancing quality and resource use.

Comparison:

Vs. GPT-5.x Codex: API models lead in raw quality; Devstral excels in predictable cost and privacy.
Vs. DeepSeek V4 Pro: Devstral is more focused on coding specialization.
Vs. Qwen 3.5/3.6 code models: Devstral Small 2’s recent training data and function-calling schema offer advantages for API scaffolding.

Ideal for teams building coding assistant stacks emphasizing self-hosting and integration into IDEs.

[INTERNAL_LINK]

Magistral Small: Reasoning-Focused Variant

Model ID: Magistral-Small-2509 (17,583 downloads)

Designed for multi-step reasoning tasks requiring interpretability and explicit chain-of-thought capabilities, Magistral Small is optimized for:

Complex planning, tool orchestration, and verification tasks.
Evaluating and judging outputs from other models.

Positioning:

More reasoning-specialized than Mistral Small 4’s generalist capabilities.
Complementary to Devstral for coding agents (Magistral for reasoning, Devstral for code generation).
Server-side deployment on 1–2 mid-memory GPUs.

While not leading benchmarks against proprietary models like Claude Opus 4.7, Magistral Small offers a valuable self-hosted alternative tuned for reasoning-centric applications.

Ministral 3: On-Device Models for Edge and Mobile

Model IDs (January 2026):

Ministral-3-3B-Instruct-2512-BF16
Ministral-3-8B-Instruct-2512-BF16
Ministral-3-14B-Instruct-2512-BF16
Ministral-3-3B-Reasoning-2512-GGUF + ONNX variants

Ministral 3 targets consumer and edge hardware, supporting:

Deployment on GPUs with 8–24 GB VRAM, suitable for laptops, edge servers, and mobile devices.
Privacy-sensitive, latency-critical applications where cloud access is restricted.
GGUF and ONNX quantized variants enable CPU-only or low-resource inference.

Comparative Landscape:

Competes with Qwen’s 0.8B–7B models for mobile efficiency.
Offers dense, easier-to-optimize alternatives to Llama 4’s small MoE variants.
Complements Gemini 3.1 Flash’s cloud API speed with local inference capabilities.

Ministral 3 is the backbone for local-first AI stacks prioritizing privacy and responsiveness.

Voxtral: Realtime Voice and Text-to-Speech Models

Model IDs (March 2026):

Voxtral-Mini-4B-Realtime-2602 – Realtime streaming voice interaction
Voxtral-4B-TTS-2603 – Text-to-speech synthesis

Voxtral enables fully local voice pipelines with:

Sub-200ms latency for voice assistants, call centers, and IVR systems.
Duplex voice interactions with interruption and turn-taking support.
Standalone TTS capabilities for speech synthesis.

With over one million downloads, Voxtral is widely adopted for regulated environments demanding data sovereignty and local processing.

Compared to API-based voice stacks (e.g., GPT-5 voice extensions), Voxtral offers control and privacy at the cost of infrastructure management.

[INTERNAL_LINK]

[IMAGE_PLACEHOLDER_SECTION_2]

Licensing, Hardware Requirements, and Deployment Considerations

Licensing Overview: Apache 2.0 vs Mistral Research License

Mistral’s 2026 models are released under two primary licenses:

Apache 2.0 License: Permissive license allowing commercial use, modification, and redistribution. Most Ministral, Devstral Small 2, and Voxtral models fall under this.
Mistral Research License: Restricts commercial deployment without a separate agreement. Flagship models like Mistral Large 3 use this license.

Important: Always verify the license on the Hugging Face model card before deploying commercially. Failure to comply can result in legal issues.

Hardware Requirements Summary

Mistral Large 3 (675B): Multi-node GPU clusters (8–16+ A100/H100 GPUs), 1.3+ TB checkpoint size, NVFP4 quantization reduces to ~340 GB.
Mistral Small 4 (119B): 2–4 GPUs (A100/H100 80GB), NVFP4 reduces VRAM to 60–80 GB; eagle speculative decoding improves latency and throughput.
Devstral 123B: Similar to Small 4, 2–4 GPUs recommended.
Devstral Small 2 (24B): Single 24–40 GB GPU (RTX 4090, A5000, L40), quantized variants fit in 12–16 GB GPUs.
Magistral Small: 1–2 24–48 GB GPUs or quantized single GPU setups.
Ministral 3 (3B/8B/14B): 3B runs on CPU or small GPUs (4–8 GB), 8B fits 12–16 GB GPUs, 14B needs 24+ GB GPUs or quantization.
Voxtral 4B: 8–12 GB GPUs suffice; CPU-only possible with quantization.

Note that KV cache size, batch size, and context length significantly affect VRAM usage in production.

Why Mistral Pivoted Away From Mixtral Sparse MoE Models

The early Mixtral 8x22B and 8x7B MoE models highlighted parameter efficiency but introduced deployment complexity:

Simplicity of dense models: Easier memory and latency reasoning, smoother deployment without complex expert routing.
Specialization over general

Please leave this field empty

Thank you! Please check your inbox (and spam folder) for a confirmation email. Click the link to get instant access to our 40,000+ ChatGPT Prompt Library.Check your inbox or spam folder to confirm your subscription.

Please leave this field empty

Thank you! Please check your inbox (and spam folder) for a confirmation email. Click the link to get instant access to our 40,000+ ChatGPT Prompt Library.Check your inbox or spam folder to confirm your subscription.

Please leave this field empty

Thank you! Please check your inbox (and spam folder) for a confirmation email. Click the link to get instant access to our 40,000+ ChatGPT Prompt Library.Check your inbox or spam folder to confirm your subscription.

Please leave this field empty

Get Free Access to 40,000+ AI Prompts for ChatGPT, Claude & Codex

Subscribe for instant access to the largest curated Notion Prompt Library for AI workflows.

Check your inbox or spam folder to confirm your subscription & get your free prompts link.

Facebook

Twitter

LinkedIn

Instagram

Previous: Google Gemma 4 and Microsoft Phi-4: The Small Open Models Guide 2026

Next: Qwen 3.5 and Qwen 3.6 Complete Guide 2026: All Sizes, Self-Hosting, Benchmarks

Markos Symeonides

LinkedIn

Twitter

Facebook

More on this

GPT-5.5 Prompts for Marketing Teams: Campaign Strategy, Copy, and Analytics
Posted in Prompts
Reading Time: 5 minutes
Introduction: Leveraging GPT-5.5 for Marketing Excellence 1. Campaign Brainstorming Purpose: Generate innovative, multi-dimensional campaign ideas tailored to your product/service and audience. Prompt Template: “Act as a senior marketing strategist. Generate 5 innovative campaign ideas for a [product/service] targeting [audience segment]…
The Complete GPT-5.5 Model Hierarchy Explained: Instant, Thinking, Pro, and Mini
Posted in AI News
Reading Time: 19 minutes
The Complete GPT-5.5 Model Hierarchy Explained: Instant, Thinking, Pro, and Mini The GPT-5.5 family represents the cutting edge of OpenAI’s language model technology, embodying a sophisticated suite of AI models tailored to meet a wide spectrum of enterprise and developer…
GPT-5.5 Memory and Personalization: How to Train ChatGPT to Work Like Your Team
Posted in Guides
Reading Time: 30 minutes
GPT-5.5 Memory and Personalization: How to Train ChatGPT to Work Like Your Team Beyond memory, GPT-5.5 introduces sophisticated personalization systems that allow organizations to fine-tune the model’s behavior, tone, and knowledge base to reflect their unique culture, workflows, and expertise…
20 GPT-5.5 Prompts for Product Management and Roadmap Planning
Posted in Prompts
Reading Time: 18 minutes
20 GPT-5.5 Prompts for Product Management and Roadmap Planning – Playbook In the rapidly evolving landscape of product development, the integration of artificial intelligence (AI) has become a pivotal factor in enhancing efficiency, accuracy, and strategic decision-making. The release of…