⚡ The Brief
- Llama 4 ships in two MoE models only: Scout with sixteen experts and Maverick with one hundred twenty-eight experts, both activating seventeen billion parameters.
- Scout delivers a ten million token context window, the longest in any production open-weight model as of April 2026.
- Maverick uses four hundred billion total parameters across one hundred twenty-eight experts, optimized for reasoning and code generation tasks.
- Both models accept native multimodal input combining text and images without requiring separate vision encoders or adapters.
- The Llama 4 Community License requires commercial negotiation beyond seven hundred million monthly active users, distinguishing it from permissive licenses.
[IMAGE_PLACEHOLDER_HEADER]
✦
Get 40K Prompts, Guides & Tools — Free
→
✓ Instant access✓ No spam✓ Unsubscribe anytime
One year after Meta released Llama 3, the AI community witnessed a bold shift with the launch of Llama 4 on April 5, 2025. Meta abandoned the traditional dense transformer architecture, embracing an exclusive Mixture-of-Experts (MoE) model design. Unlike previous generations that offered multiple size variants such as 8B, 70B, or 405B parameters, Llama 4 features only two MoE models: Scout and Maverick. Both models activate exactly 17 billion parameters per forward pass but leverage expert pools of 109 billion and 400 billion total parameters, respectively. Scout runs with 16 experts, whereas Maverick employs 128 experts. Additionally, both models natively accept multimodal inputs, seamlessly combining text and image data without needing separate vision encoders or adapters. Notably, Scout supports an unprecedented 10-million-token context window — the longest available in any open-weight model as of April 2026.
This comprehensive guide serves as a practical reference for deploying Llama 4 in production environments as of April 27, 2026. We delve into the real-world hardware requirements, benchmark results showcasing where Scout and Maverick excel or fall short, three distinct deployment strategies ranging from single-node inference to multi-GPU clusters, license considerations including the critical 700M monthly active users clause, and direct comparisons against competing open-weight models like DeepSeek V4, Qwen 3.5/3.6, and Mistral Large 3. This is a data-driven analysis with no hype — just facts, performance metrics, and configuration examples to help you make informed decisions.
The Llama 4 Family: Scout and Maverick (and Why No 70B)
Meta launched exactly two official Llama 4 inference models on April 5, 2025, with the most recent update on May 22, 2025. Both models are instruction-tuned multimodal transformers available on Hugging Face under the Llama 4 Community License:
- meta-llama/Llama-4-Scout-17B-16E-Instruct — Activates 17B parameters with 16 experts, totaling approximately 109B parameters. Scout is the most widely adopted Llama 4 variant with 388,689 downloads as of this writing. Its standout feature is the massive 10-million-token context window, ideal for document-heavy and long-range reasoning tasks.
- meta-llama/Llama-4-Maverick-17B-128E-Instruct — Also activates 17B parameters but spans 128 experts, totaling around 400B parameters. Maverick is optimized for complex reasoning and coding workloads. The base checkpoint has 38,935 downloads, while the FP8 quantized variant (Maverick-FP8) has 103,065 downloads, reflecting strong demand for reduced-memory inference.
Why did Meta skip dense models like 8B or 70B in Llama 4? The reason lies in the economics of inference. MoE models activate only a subset of their total parameters for each forward pass — 17B out of 109B for Scout, or 17B out of 400B for Maverick. This allows models to achieve the representational capacity of massive parameter counts while incurring compute costs closer to a 17B dense model. For Meta’s massive scale — powering billions of users on Instagram, WhatsApp, and internal AI tools — this efficiency is essential.
Legacy dense models from Llama 3 remain available, such as Meta-Llama-3-8B-Instruct (1.4M downloads) and Meta-Llama-3-70B-Instruct, which continue serving production needs. However, Llama 4 marks a clean architectural departure. Infrastructure designed for dense transformers with straightforward tensor parallelism must adapt to expert routing and MoE paradigms to leverage Llama 4 effectively. For a broader context on open-weight models and architectures, see our open-source AI guide.
[INTERNAL_LINK]
Mixture-of-Experts in Plain English: 17B Active, 109B–400B Total
Mixture-of-Experts (MoE) is a powerful architectural approach that has gained traction since Google’s 2021 Switch Transformer paper. Llama 4 is the first major open-weight family to ship exclusively with MoE. Here’s how it works in practice:
Traditional dense transformers process every token through all model parameters. For instance, a 70B dense model uses all 70 billion parameters for every forward pass. In contrast, MoE models split feed-forward layers into multiple independent “experts” — separate neural network blocks specialized for different aspects of the input. A routing mechanism selects which experts process each token.
In Llama 4 Scout, each MoE layer contains 16 experts; Maverick layers contain 128 experts. For every token, only 2 experts activate, meaning the model dynamically routes tokens to the most relevant experts, reducing compute while preserving capacity.
Parameter and compute characteristics for Scout:
- Total parameters: ~109B (all experts plus shared attention layers)
- Activated parameters per forward pass: 17B (shared attention layers plus 2 selected experts)
- Memory footprint: All 109B parameters must be loaded into VRAM even though only 17B activate per token
- Compute cost: Proportional to 17B parameters, not 109B
For Maverick:
- Total parameters: ~400B
- Activated parameters: 17B
- Memory footprint: All 400B parameters must be accessible, requiring multi-GPU setups even with FP8 quantization
This architecture creates an interesting trade-off. MoE models are memory-bound, not compute-bound. You need enough VRAM to hold all expert weights, even though many sit idle during each forward pass. The upside is access to representations learned from a massive 400B parameter space while paying inference costs closer to a 17B model. This efficiency is favorable in throughput-sensitive deployments.
[IMAGE_PLACEHOLDER_SECTION_1]
Hardware Requirements: The Honest Numbers
Let’s discuss the real hardware requirements based on community and production experience over the past year running Llama 4 on various infrastructures, from single H100 GPUs to 8-GPU clusters:
Scout (109B total, 17B active):
- FP16/BF16 full precision: ~220GB VRAM minimum, typically 3× A100-80GB or 3× H100-80GB with tensor parallelism
- FP8 quantized: ~110GB VRAM, achievable on 2× H100-80GB GPUs
- INT4 quantized (AWQ/GPTQ): ~55-60GB VRAM, fits on a single A100-80GB or H100-80GB with careful memory management but with reduced throughput
- 10M token context window KV cache: At full 10M token context in FP16, the KV cache alone exceeds 500GB. In practice, deployments limit context windows to 128K–512K tokens and reserve 10M tokens for specialized retrieval or long document analysis.
Maverick (400B total, 17B active):
- FP16/BF16 full precision: ~800GB VRAM, requiring 10× H100-80GB GPUs or a full DGX H100 node
- FP8 quantized (Maverick-FP8): ~400GB VRAM, achievable on 5× H100-80GB with expert parallelism
- INT4 quantized: ~200GB VRAM, still needs at least 3× H100-80GB GPUs
Cloud inference costs as of April 2026:
- H100-80GB on-demand (AWS p5.xlarge equivalent): $3.50–4.00 per GPU per hour
- Scout FP8 (2× H100): ~$7–8/hour for inference serving
- Maverick FP8 (5× H100): ~$18–20/hour for inference serving
- A100-80GB on-demand: $2.50–3.00 per GPU per hour (still viable for Scout INT4 quantized deployment)
API-based access through providers like Together or Fireworks runs approximately $0.10–0.30 per million input tokens for Scout and $0.30–0.80 per million for Maverick, varying by provider and volume. The break-even point for self-hosting versus API usage depends heavily on your request volume, latency requirements, and operational expertise.
[INTERNAL_LINK]
Three Real Deployment Paths for Llama 4
Based on extensive community and enterprise deployment experience, there are three viable paths to run Llama 4 models effectively. Each path optimizes for different constraints such as hardware availability, throughput, and latency.
Path 1: vLLM with Tensor Parallelism (Recommended for Most Teams)
vLLM added native support for Llama 4 MoE models in version 0.6.0 and is the default serving framework for most deployments. It handles expert parallelism and routing automatically and supports FP8 quantization out of the box, balancing ease of use and performance.
#!/bin/bash
# Deploy Llama-4-Scout with vLLM on 2× H100-80GB
# Requires vLLM >= 0.6.0 and CUDA 12.1+
pip install vllm>=0.6.0
# Launch server with tensor parallelism across 2 GPUs
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-4-Scout-17B-16E-Instruct \
--tensor-parallel-size 2 \
--dtype float16 \
--max-model-len 131072 \
--gpu-memory-utilization 0.95 \
--enable-chunked-prefill \
--port 8000
# For FP8 quantization (reduces memory, fits on fewer GPUs):
# add flags --quantization fp8 --dtype float16
# Test the deployment
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
"messages": [{"role": "user", "content": "Explain MoE architectures in 3 sentences."}],
"max_tokens": 256
}'
Path 2: TensorRT-LLM for Maximum Throughput
NVIDIA’s TensorRT-LLM framework delivers the highest throughput for Llama 4 but requires additional setup and compilation time. The 2026-Q1 release added full MoE support with optimized expert kernels. Expect 30–50% higher throughput than vLLM on equivalent hardware. Be prepared for 2–4 hours of model engine compilation before deployment.
Path 3: llama.cpp for Single-GPU and CPU Inference
The community fork llama.cpp integrated MoE support in August 2025. Using GGUF-quantized Scout weights (4-bit), you can run inference on a single NVIDIA RTX 4090 (24GB VRAM) or even on Apple Silicon Macs like the M2 Ultra with 192GB unified memory. Expect modest throughput of 5–15 tokens per second depending on quantization. This path suits development, experimentation, and low-throughput internal tools but is not production-viable for high-volume serving.
[IMAGE_PLACEHOLDER_SECTION_2]
Multimodal: Native Text + Image Input
Both Llama 4 Scout and Maverick models natively accept interleaved text and image inputs. This is not a post-hoc addition or separate vision encoder adapter — the image encoder is trained jointly with the language model, enabling seamless multimodal reasoning.
The image encoder employs a Vision Transformer (ViT)-style architecture, converting images into visual tokens embedded in the same space as text tokens. Images are resized and tiled dynamically, with each tile consuming approximately 256 tokens of context.
Practical considerations for multimodal inputs:
- A single 1024×1024 image consumes roughly 1,000–1,500 tokens of context
- Multiple images per prompt are supported up to the maximum context window
- Image quality impacts understanding — highly compressed or noisy JPEGs degrade accuracy
- Scout’s 10M token context window enables workflows involving hundreds of images and large documents
This native multimodal capability positions Llama 4 competitively against models like Gemini 3.1 Flash and Claude Sonnet 4.6 for tasks such as document understanding, chart reading, and visual question answering. On the MMMU benchmark (multimodal reasoning), Scout scores 61.2 and Maverick scores 69.8, while Claude Opus 4.7 leads with 74.1 and GPT-5 scores 72.3.
[INTERNAL_LINK]
Long Context: Scout’s 10M-Token Window
Scout’s 10-million-token context window is the longest available in any open-weight model as of April 2026. To contextualize this scale, 10 million tokens roughly equal 7.5 million words or about 15,000 pages of dense text — enough to fit the entire Lord of the Rings trilogy, the complete works of Shakespeare, and the King James Bible in a single prompt with tokens to spare.
Ach

