⚡ The Brief
- DeepSeek V4-Pro and V4-Flash are open-weight Mixture-of-Experts (MoE) models released on April 27, 2026, featuring a commercially permissive MIT-derived license enabling production deployment.
- V4-Pro is designed for frontier-class benchmarks emphasizing maximum quality, while V4-Flash prioritizes lower latency and cost-efficiency with smaller active parameters during inference.
- Multiple self-hosting options are available, including GPU cluster-focused frameworks like vLLM and SGLang, as well as consumer-friendly llama.cpp GGUF quantizations for CPU and edge devices.
- Hardware requirements vary from dual RTX 4090 GPUs for Flash quantized models to multi-node A100 or H100 clusters for full-precision Pro deployments.
- V4 supersedes the December 2025 V3.2 generation, delivering improved benchmark performance across reasoning, coding, and multilingual tasks.
[IMAGE_PLACEHOLDER_HEADER]
✦
Get 40K Prompts, Guides & Tools — Free
→
✓ Instant access✓ No spam✓ Unsubscribe anytime
Released on April 27, 2026, DeepSeek V4 has rapidly influenced the open-weight LLM landscape by offering two specialized Mixture-of-Experts (MoE) variants: DeepSeek V4-Pro and DeepSeek V4-Flash. Both models come with a commercially permissive MIT-derived license, empowering developers and organizations to deploy cutting-edge AI locally or in production environments without restrictive legal barriers.
This comprehensive guide offers a thorough, data-driven exploration of DeepSeek V4’s architecture, deployment options, hardware requirements, benchmark performance, and licensing nuances. Whether you’re a developer, data scientist, or AI infrastructure engineer, this article will equip you with the practical knowledge to self-host these models effectively and understand where they stand in the competitive open AI ecosystem of 2026.
DeepSeek V4 Family Overview: Pro vs Flash
[IMAGE_PLACEHOLDER_SECTION_1]
The DeepSeek V4 family consists exclusively of MoE models, marking a strategic shift from previous generations that included dense variants. The two main public instruction-tuned models are hosted on Hugging Face and have rapidly gained traction:
- deepseek-ai/DeepSeek-V4-Pro – The flagship, quality-first MoE model designed to rival or exceed frontier-class APIs like GPT-5.4-Pro and Claude Opus 4.7. It boasts the largest capacity and highest performance, suitable for applications demanding top-notch understanding and generation.
- deepseek-ai/DeepSeek-V4-Flash – A smaller, latency-optimized MoE variant focused on fast inference and operational cost-efficiency. It activates fewer experts, reducing computational load while maintaining solid reasoning and coding capabilities, ideal for real-time applications and edge deployments.
Both models also have corresponding base (non-instruction tuned) versions for advanced users intending to perform custom instruction tuning or domain adaptation:
- deepseek-ai/DeepSeek-V4-Pro-Base
- deepseek-ai/DeepSeek-V4-Flash-Base
Key Technical Notes for Self-Hosting:
- MoE-only design: Unlike previous versions, V4 does not offer dense counterparts. Each token activates a subset of experts, reducing per-token compute but requiring full expert storage in memory.
- Pro vs Flash trade-offs: V4-Pro is suited for maximum accuracy and versatility, while V4-Flash targets lower latency and smaller hardware footprints without sacrificing core capabilities.
- Commercial license: The MIT-derived license permits commercial use with certain attribution and usage guidelines – essential to review before deployment.
For newcomers to open-source AI models and hosting frameworks, we recommend reviewing our Open-Source AI Hub overview for broader context.
Evolution from DeepSeek V3.2 to V4: What’s New?
DeepSeek V3 and its incremental update V3.2 (released in December 2025) already established a strong presence with millions of downloads and several specialized offshoots such as DeepSeek-V3.2-Speciale and domain-specific models like DeepSeek-OCR-2 and DeepSeek-Math-V2.
DeepSeek V4 represents a fundamental architectural and product strategy shift, characterized by:
- MoE-Only Architecture: While V3.x versions included MoE components, V4 fully commits to an MoE-only approach with a clear bifurcation between Pro and Flash models. This simplifies deployment decisions but requires infrastructure capable of handling expert routing and storage.
- Clear Latency vs Quality SKU Split: Instead of experimental variants, V4 introduces explicit product lines—Pro for highest quality and Flash for latency-sensitive use cases—aligning well with real-world deployment needs.
- Robust Ecosystem Support at Launch: V4 benefits from immediate quantization and serving framework support, including:
GGUFformat support for llama.cpp enabling CPU and consumer GPU usage.MLX 4-bit and 8-bitquantizations optimized for Apple Silicon (M2/M3 series).FP8quantization with SGLang for efficient GPU serving.- Advanced mixed-precision and quantization variants for flexible deployment.
- Frontier API Benchmark Positioning: V4 is explicitly designed to compete with state-of-the-art commercial APIs like GPT-5.2, GPT-5.4, and Claude Sonnet 4.6, rather than older GPT-4-tier models.
In summary, existing DeepSeek V3.2 users looking for enhanced quality or performance should consider migrating to V4-Pro or V4-Flash depending on their deployment priorities.
Understanding Mixture-of-Experts (MoE) Architecture in Plain Language
MoE architecture fundamentally alters how transformer models process tokens, with direct implications for performance, memory, and serving stack complexity.
Dense Models: A typical dense transformer model (e.g., 70B parameters) activates all parameters for every token. This is straightforward but computationally expensive.
Mixture-of-Experts Models:
- The model is divided into multiple specialized experts, each a sub-network, usually embedded within feed-forward layers.
- A lightweight router module dynamically selects a small subset of experts (e.g., top-2) to process each token based on its content.
- Only the activated experts run for a given token, reducing computation per token, but all experts must reside in memory for routing.
Operational Implications:
- Memory Footprint: You must load all experts into GPU memory, so memory usage aligns with the total parameter count.
- Compute Efficiency: Actual FLOPs per token are significantly reduced since only a fraction of experts are active, offering better quality-per-FLOP compared to dense models of similar size.
- Serving Frameworks: MoE models demand MoE-aware serving stacks that efficiently handle routing and parallelism. Frameworks like vLLM and SGLang are tailored for such needs.
DeepSeek V4 Variants:
- V4-Pro: Features more experts and wider expert networks, delivering maximum reasoning and generation quality but requiring heavier infrastructure.
- V4-Flash: Uses fewer or smaller experts, optimized for cost-sensitive, latency-critical environments with faster inference.
Hardware Requirements: Realistic Deployment Scenarios for V4-Pro and V4-Flash
[IMAGE_PLACEHOLDER_SECTION_2]
Accurately sizing hardware for DeepSeek V4 requires considering model size, precision, quantization, and concurrency demands. Below are practical, conservative guidelines based on community experience, model cards, and observed benchmarks.
Key Assumptions
- V4-Pro: A large-scale MoE model with an effective parameter footprint comparable to 400–700 billion parameters, activating fewer than 100 billion parameters per token.
- V4-Flash: Mid-sized MoE optimized for latency and cost, with an effective footprint similar to 100–200 billion parameters and fewer than 30 billion active parameters per token.
- Precision Formats: FP16/BF16 (2 bytes/parameter), FP8 (~1 byte/parameter), and 4-bit (~0.5 bytes/parameter) quantizations impact memory footprint significantly.
- KV Cache: Add approximately 30% memory overhead for key-value caches, especially for long context windows.
V4-Flash Hardware Tiers
- Single Consumer GPU (e.g., RTX 4090 24GB, RTX 3090 24GB):
- Supports 4-bit or FP8 quantized models (e.g.,
sgl-project/DeepSeek-V4-Flash-FP8, GGUF 4-bit). - Suitable for low concurrency (1–4 requests) and short context lengths (<2000 tokens).
- Token throughput: ~20–40 tokens/s per stream on RTX 4090 with llama.cpp in 4-bit GGUF mode.
- Supports 4-bit or FP8 quantized models (e.g.,
- Single Data Center GPU (A100 40GB, L40S 48GB, H100 80GB):
- Supports full FP16, BF16, or FP8 precision with moderate to high concurrency (5–20 streams).
- H100 80GB enables high-concurrency, long-context serving with FP8 quantization.
- Apple Silicon (M2/M3 Pro/Max/Ultra):
- Run MLX 4-bit or 8-bit quantized variants (e.g.,
mlx-community/DeepSeek-V4-Flash-4bit). - Good for local experimentation and small-team inference.
- Run MLX 4-bit or 8-bit quantized variants (e.g.,
V4-Pro Hardware Tiers
- Multi-GPU Data Center (Minimum 4× 80GB GPUs):
- Recommended 4× A100 80GB, H100 80GB, or AMD MI300-class GPUs for FP16/BF16 full-precision serving.
- FP8 and 4-bit quantizations can reduce memory requirements, allowing 2× 80GB GPUs, but with constrained concurrency and context length.
- 4× 48GB GPUs (e.g., L40S) require aggressive quantization and KV cache optimization.
- Consumer GPUs:
- Not practical for full-precision V4-Pro due to VRAM limitations.
- V4-Flash is the better option for consumer-grade GPUs.
For production-grade deployments, budget additional memory and compute headroom to accommodate concurrency spikes, longer contexts, and future updates.
Self-Hosting DeepSeek V4: Three Practical Frameworks
Self-hosting DeepSeek V4 models requires leveraging frameworks that support MoE architectures, quantization formats, and efficient GPU utilization. We highlight three major paths suitable for different hardware and use cases.
1. vLLM: High-Throughput GPU Serving
vLLM is a production-grade transformer serving framework optimized for efficient batch processing and MoE support. It enables running DeepSeek V4-Pro and V4-Flash on CUDA-enabled GPUs with high throughput and OpenAI-compatible APIs.
bash
# Install vLLM (requires CUDA 12.x environment)
pip install "vllm>=0.5.0"
# Launch DeepSeek V4-Flash with vLLM on a single GPU
python -m vllm.entrypoints.openai.api_server \
--model deepseek-ai/DeepSeek-V4-Flash \
--dtype bfloat16 \
--max-model-len 8192 \
--gpu-memory-utilization 0.9 \
--port 8000
This setup exposes an OpenAI-compatible REST API endpoint at http://localhost:8000/v1/chat/completions. For V4-Pro, swap the model ID to deepseek-ai/DeepSeek-V4-Pro and ensure multi-GPU support with tensor parallelism flags.
Sample Python client usage:
python
import requests
response = requests.post(
"http://localhost:8000/v1/chat/completions",
headers={"Content-Type": "application/json"},
json={
"model": "deepseek-ai/DeepSeek-V4-Flash",
"messages": [
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Write a Python function to deduplicate a list while preserving order."}
],
"max_tokens": 256,
"temperature": 0.2,
},
)
print(response.json()["choices"][0]["message"]["content"])
2. SGLang: MoE and FP8-Optimized Serving
SGLang is tailored for MoE architectures and supports FP8 quantization, making it a natural fit for DeepSeek V4-Flash deployments on GPUs where memory constraints exist.
bash
# Install SGLang
pip install "sglang>=0.3.0"
# Serve DeepSeek V4-Flash FP8 model
sglang serve \
--model sgl-project/DeepSeek-V4-Flash-FP8 \
--port 8080 \
--max-model-len 8192 \
--tp-size 1

