Is there a Llama 4 8B or 70B model?

No. Meta discontinued dense architectures and traditional size tiers for Llama 4. The entire generation uses Mixture-of-Experts exclusively. Only two variants exist: Scout (17B activated, 16 experts, ~109B total) and Maverick (17B activated, 128 experts, ~400B total). Legacy Llama 3 models like Meta-Llama-3-8B-Instruct remain available on Hugging Face with over 1.4 million downloads, but received their final updates in June 2025 and are not part of the Llama 4 family.

How much VRAM does Llama 4 Scout need for local deployment?

Scout requires approximately 60–70 GB in FP16 for the full 109 billion total parameters. A single NVIDIA H100 (80 GB) or two RTX 4090s (48 GB combined via tensor parallelism) can run inference. Quantization to 4-bit with tools like llama.cpp or vLLM reduces requirements to 28–32 GB, fitting consumer cards like the RTX 4090 or 3090. Maverick demands 200–220 GB in FP16; practical self-hosting uses FP8 quantization (roughly 110 GB) or multi-GPU setups with A100 or H100 clusters.

What is the 700M-MAU threshold in the Llama 4 license?

The Llama 4 Community License permits free use for applications or services with fewer than seven hundred million monthly active users. Once you exceed this threshold, you must negotiate a separate commercial agreement with Meta. This clause distinguishes Llama 4 from permissive licenses like Apache 2.0 or MIT. The license also includes an acceptable use policy prohibiting harmful applications. For startups and mid-size enterprises, the threshold is rarely an issue, but hyperscale deployments require advance planning.

Does Llama 4 support vision input natively?

Yes. Both meta-llama/Llama-4-Scout-17B-16E-Instruct and meta-llama/Llama-4-Maverick-17B-128E-Instruct accept text and image inputs without requiring separate vision encoders. This native multimodal capability simplifies deployment for document analysis, visual question answering, and OCR-heavy workflows. The models tokenize images alongside text, eliminating the need for adapter layers or preprocessing pipelines common in earlier vision-language models. Llama Guard 4 also supports multimodal filtering to scan both text and image outputs for policy violations.

How does Scout's 10M-token context compare to competitors in 2026?

Scout's ten million token context window is the longest among open-weight models as of April 2026. DeepSeek V4 offers two million tokens, Qwen 3.5 supports 512K, and Mistral Large 3 caps at 128K. This advantage enables Scout to process entire codebases, multi-hour meeting transcripts, or hundreds of PDFs in a single prompt. The cost is higher memory footprint during inference; expect 120–140 GB peak memory under full context load. Maverick sacrifices context length for reasoning depth, offering a smaller window comparable to Qwen 3.5.

Which model should I choose: Scout or Maverick?

Choose Scout for tasks requiring extreme context length—legal document review, multi-file code analysis, long-form content generation, or retrieval-augmented workflows where you embed hundreds of chunks. Scout's sixteen experts and ten million token window suit these use cases. Choose Maverick for complex reasoning, mathematical problem-solving, competitive programming, or chain-of-thought workflows. Maverick's one hundred twenty-eight experts yield stronger performance on MMLU-Pro, HumanEval, and MATH benchmarks. Both models share the same seventeen billion activated parameters, so latency per token remains comparable when context is short.

Models

Llama 4 Scout & Maverick Complete Guide 2026: Architecture, Self-Hosting, Benchmarks

Markos Symeonides

May 10, 2026

⚡ The Brief

Llama 4 ships in two MoE models only: Scout with sixteen experts and Maverick with one hundred twenty-eight experts, both activating seventeen billion parameters.
Scout delivers a ten million token context window, the longest in any production open-weight model as of April 2026.
Maverick uses four hundred billion total parameters across one hundred twenty-eight experts, optimized for reasoning and code generation tasks.
Both models accept native multimodal input combining text and images without requiring separate vision encoders or adapters.
The Llama 4 Community License requires commercial negotiation beyond seven hundred million monthly active users, distinguishing it from permissive licenses.

[IMAGE_PLACEHOLDER_HEADER]

✦
Get 40K Prompts, Guides & Tools — Free
→

✓ Instant access✓ No spam✓ Unsubscribe anytime

One year after Meta released Llama 3, the AI community witnessed a bold shift with the launch of Llama 4 on April 5, 2025. Meta abandoned the traditional dense transformer architecture, embracing an exclusive Mixture-of-Experts (MoE) model design. Unlike previous generations that offered multiple size variants such as 8B, 70B, or 405B parameters, Llama 4 features only two MoE models: Scout and Maverick. Both models activate exactly 17 billion parameters per forward pass but leverage expert pools of 109 billion and 400 billion total parameters, respectively. Scout runs with 16 experts, whereas Maverick employs 128 experts. Additionally, both models natively accept multimodal inputs, seamlessly combining text and image data without needing separate vision encoders or adapters. Notably, Scout supports an unprecedented 10-million-token context window — the longest available in any open-weight model as of April 2026.

This comprehensive guide serves as a practical reference for deploying Llama 4 in production environments as of April 27, 2026. We delve into the real-world hardware requirements, benchmark results showcasing where Scout and Maverick excel or fall short, three distinct deployment strategies ranging from single-node inference to multi-GPU clusters, license considerations including the critical 700M monthly active users clause, and direct comparisons against competing open-weight models like DeepSeek V4, Qwen 3.5/3.6, and Mistral Large 3. This is a data-driven analysis with no hype — just facts, performance metrics, and configuration examples to help you make informed decisions.

The Llama 4 Family: Scout and Maverick (and Why No 70B)

Meta launched exactly two official Llama 4 inference models on April 5, 2025, with the most recent update on May 22, 2025. Both models are instruction-tuned multimodal transformers available on Hugging Face under the Llama 4 Community License:

meta-llama/Llama-4-Scout-17B-16E-Instruct — Activates 17B parameters with 16 experts, totaling approximately 109B parameters. Scout is the most widely adopted Llama 4 variant with 388,689 downloads as of this writing. Its standout feature is the massive 10-million-token context window, ideal for document-heavy and long-range reasoning tasks.
meta-llama/Llama-4-Maverick-17B-128E-Instruct — Also activates 17B parameters but spans 128 experts, totaling around 400B parameters. Maverick is optimized for complex reasoning and coding workloads. The base checkpoint has 38,935 downloads, while the FP8 quantized variant (Maverick-FP8) has 103,065 downloads, reflecting strong demand for reduced-memory inference.

Why did Meta skip dense models like 8B or 70B in Llama 4? The reason lies in the economics of inference. MoE models activate only a subset of their total parameters for each forward pass — 17B out of 109B for Scout, or 17B out of 400B for Maverick. This allows models to achieve the representational capacity of massive parameter counts while incurring compute costs closer to a 17B dense model. For Meta’s massive scale — powering billions of users on Instagram, WhatsApp, and internal AI tools — this efficiency is essential.

Legacy dense models from Llama 3 remain available, such as Meta-Llama-3-8B-Instruct (1.4M downloads) and Meta-Llama-3-70B-Instruct, which continue serving production needs. However, Llama 4 marks a clean architectural departure. Infrastructure designed for dense transformers with straightforward tensor parallelism must adapt to expert routing and MoE paradigms to leverage Llama 4 effectively. For a broader context on open-weight models and architectures, see our open-source AI guide.

[INTERNAL_LINK]

Mixture-of-Experts in Plain English: 17B Active, 109B–400B Total

Mixture-of-Experts (MoE) is a powerful architectural approach that has gained traction since Google’s 2021 Switch Transformer paper. Llama 4 is the first major open-weight family to ship exclusively with MoE. Here’s how it works in practice:

Traditional dense transformers process every token through all model parameters. For instance, a 70B dense model uses all 70 billion parameters for every forward pass. In contrast, MoE models split feed-forward layers into multiple independent “experts” — separate neural network blocks specialized for different aspects of the input. A routing mechanism selects which experts process each token.

In Llama 4 Scout, each MoE layer contains 16 experts; Maverick layers contain 128 experts. For every token, only 2 experts activate, meaning the model dynamically routes tokens to the most relevant experts, reducing compute while preserving capacity.

Parameter and compute characteristics for Scout:

Total parameters: ~109B (all experts plus shared attention layers)
Activated parameters per forward pass: 17B (shared attention layers plus 2 selected experts)
Memory footprint: All 109B parameters must be loaded into VRAM even though only 17B activate per token
Compute cost: Proportional to 17B parameters, not 109B

For Maverick:

Total parameters: ~400B
Activated parameters: 17B
Memory footprint: All 400B parameters must be accessible, requiring multi-GPU setups even with FP8 quantization

This architecture creates an interesting trade-off. MoE models are memory-bound, not compute-bound. You need enough VRAM to hold all expert weights, even though many sit idle during each forward pass. The upside is access to representations learned from a massive 400B parameter space while paying inference costs closer to a 17B model. This efficiency is favorable in throughput-sensitive deployments.

[IMAGE_PLACEHOLDER_SECTION_1]

Hardware Requirements: The Honest Numbers

Let’s discuss the real hardware requirements based on community and production experience over the past year running Llama 4 on various infrastructures, from single H100 GPUs to 8-GPU clusters:

Scout (109B total, 17B active):

FP16/BF16 full precision: ~220GB VRAM minimum, typically 3× A100-80GB or 3× H100-80GB with tensor parallelism
FP8 quantized: ~110GB VRAM, achievable on 2× H100-80GB GPUs
INT4 quantized (AWQ/GPTQ): ~55-60GB VRAM, fits on a single A100-80GB or H100-80GB with careful memory management but with reduced throughput
10M token context window KV cache: At full 10M token context in FP16, the KV cache alone exceeds 500GB. In practice, deployments limit context windows to 128K–512K tokens and reserve 10M tokens for specialized retrieval or long document analysis.

Maverick (400B total, 17B active):

FP16/BF16 full precision: ~800GB VRAM, requiring 10× H100-80GB GPUs or a full DGX H100 node
FP8 quantized (Maverick-FP8): ~400GB VRAM, achievable on 5× H100-80GB with expert parallelism
INT4 quantized: ~200GB VRAM, still needs at least 3× H100-80GB GPUs

Cloud inference costs as of April 2026:

H100-80GB on-demand (AWS p5.xlarge equivalent): $3.50–4.00 per GPU per hour
Scout FP8 (2× H100): ~$7–8/hour for inference serving
Maverick FP8 (5× H100): ~$18–20/hour for inference serving
A100-80GB on-demand: $2.50–3.00 per GPU per hour (still viable for Scout INT4 quantized deployment)

API-based access through providers like Together or Fireworks runs approximately $0.10–0.30 per million input tokens for Scout and $0.30–0.80 per million for Maverick, varying by provider and volume. The break-even point for self-hosting versus API usage depends heavily on your request volume, latency requirements, and operational expertise.

[INTERNAL_LINK]

Three Real Deployment Paths for Llama 4

Based on extensive community and enterprise deployment experience, there are three viable paths to run Llama 4 models effectively. Each path optimizes for different constraints such as hardware availability, throughput, and latency.

Path 1: vLLM with Tensor Parallelism (Recommended for Most Teams)

vLLM added native support for Llama 4 MoE models in version 0.6.0 and is the default serving framework for most deployments. It handles expert parallelism and routing automatically and supports FP8 quantization out of the box, balancing ease of use and performance.

#!/bin/bash
# Deploy Llama-4-Scout with vLLM on 2× H100-80GB
# Requires vLLM >= 0.6.0 and CUDA 12.1+

pip install vllm>=0.6.0

# Launch server with tensor parallelism across 2 GPUs
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-4-Scout-17B-16E-Instruct \
    --tensor-parallel-size 2 \
    --dtype float16 \
    --max-model-len 131072 \
    --gpu-memory-utilization 0.95 \
    --enable-chunked-prefill \
    --port 8000

# For FP8 quantization (reduces memory, fits on fewer GPUs):
# add flags --quantization fp8 --dtype float16

# Test the deployment
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
        "messages": [{"role": "user", "content": "Explain MoE architectures in 3 sentences."}],
        "max_tokens": 256
    }'

Path 2: TensorRT-LLM for Maximum Throughput

NVIDIA’s TensorRT-LLM framework delivers the highest throughput for Llama 4 but requires additional setup and compilation time. The 2026-Q1 release added full MoE support with optimized expert kernels. Expect 30–50% higher throughput than vLLM on equivalent hardware. Be prepared for 2–4 hours of model engine compilation before deployment.

Path 3: llama.cpp for Single-GPU and CPU Inference

The community fork llama.cpp integrated MoE support in August 2025. Using GGUF-quantized Scout weights (4-bit), you can run inference on a single NVIDIA RTX 4090 (24GB VRAM) or even on Apple Silicon Macs like the M2 Ultra with 192GB unified memory. Expect modest throughput of 5–15 tokens per second depending on quantization. This path suits development, experimentation, and low-throughput internal tools but is not production-viable for high-volume serving.

[IMAGE_PLACEHOLDER_SECTION_2]

Multimodal: Native Text + Image Input

Both Llama 4 Scout and Maverick models natively accept interleaved text and image inputs. This is not a post-hoc addition or separate vision encoder adapter — the image encoder is trained jointly with the language model, enabling seamless multimodal reasoning.

The image encoder employs a Vision Transformer (ViT)-style architecture, converting images into visual tokens embedded in the same space as text tokens. Images are resized and tiled dynamically, with each tile consuming approximately 256 tokens of context.

Practical considerations for multimodal inputs:

A single 1024×1024 image consumes roughly 1,000–1,500 tokens of context
Multiple images per prompt are supported up to the maximum context window
Image quality impacts understanding — highly compressed or noisy JPEGs degrade accuracy
Scout’s 10M token context window enables workflows involving hundreds of images and large documents

This native multimodal capability positions Llama 4 competitively against models like Gemini 3.1 Flash and Claude Sonnet 4.6 for tasks such as document understanding, chart reading, and visual question answering. On the MMMU benchmark (multimodal reasoning), Scout scores 61.2 and Maverick scores 69.8, while Claude Opus 4.7 leads with 74.1 and GPT-5 scores 72.3.

[INTERNAL_LINK]

Long Context: Scout’s 10M-Token Window

Scout’s 10-million-token context window is the longest available in any open-weight model as of April 2026. To contextualize this scale, 10 million tokens roughly equal 7.5 million words or about 15,000 pages of dense text — enough to fit the entire Lord of the Rings trilogy, the complete works of Shakespeare, and the King James Bible in a single prompt with tokens to spare.

Ach

Markos Symeonides

GPT-5.5 Prompts for Marketing Teams: Campaign Strategy, Copy, and Analytics

Posted in Prompts

Reading Time: 5 minutes

Introduction: Leveraging GPT-5.5 for Marketing Excellence 1. Campaign Brainstorming Purpose: Generate innovative, multi-dimensional campaign ideas tailored to your product/service and audience. Prompt Template: “Act as a senior marketing strategist. Generate 5 innovative campaign ideas for a [product/service] targeting [audience segment]…

The Complete GPT-5.5 Model Hierarchy Explained: Instant, Thinking, Pro, and Mini

Posted in AI News

Reading Time: 19 minutes

The Complete GPT-5.5 Model Hierarchy Explained: Instant, Thinking, Pro, and Mini The GPT-5.5 family represents the cutting edge of OpenAI’s language model technology, embodying a sophisticated suite of AI models tailored to meet a wide spectrum of enterprise and developer…

GPT-5.5 Memory and Personalization: How to Train ChatGPT to Work Like Your Team

Posted in Guides

Reading Time: 30 minutes

GPT-5.5 Memory and Personalization: How to Train ChatGPT to Work Like Your Team Beyond memory, GPT-5.5 introduces sophisticated personalization systems that allow organizations to fine-tune the model’s behavior, tone, and knowledge base to reflect their unique culture, workflows, and expertise…

20 GPT-5.5 Prompts for Product Management and Roadmap Planning

Posted in Prompts

Reading Time: 18 minutes

20 GPT-5.5 Prompts for Product Management and Roadmap Planning – Playbook In the rapidly evolving landscape of product development, the integration of artificial intelligence (AI) has become a pivotal factor in enhancing efficiency, accuracy, and strategic decision-making. The release of…

Llama 4 Scout & Maverick Complete Guide 2026: Architecture, Self-Hosting, Benchmarks

⚡ The Brief

The Llama 4 Family: Scout and Maverick (and Why No 70B)

Mixture-of-Experts in Plain English: 17B Active, 109B–400B Total

Hardware Requirements: The Honest Numbers

Three Real Deployment Paths for Llama 4

Path 1: vLLM with Tensor Parallelism (Recommended for Most Teams)

Path 2: TensorRT-LLM for Maximum Throughput

Path 3: llama.cpp for Single-GPU and CPU Inference

Multimodal: Native Text + Image Input

Long Context: Scout’s 10M-Token Window

Get Free Access to 40,000+ AI Prompts for ChatGPT, Claude & Codex

More on this

GPT-5.5 Prompts for Marketing Teams: Campaign Strategy, Copy, and Analytics

The Complete GPT-5.5 Model Hierarchy Explained: Instant, Thinking, Pro, and Mini

GPT-5.5 Memory and Personalization: How to Train ChatGPT to Work Like Your Team

20 GPT-5.5 Prompts for Product Management and Roadmap Planning