Run Google's Diffusion LLM Locally — DiffusionGemma
A 26B MoE model that generates 256-token blocks in parallel — 4x faster than autoregressive. Fits on a single 24GB GPU with quantization.
What Is DiffusionGemma
DiffusionGemma generates text like an image diffusion model: instead of typing one word at a time left-to-right, it fills in a whole 256-token block at once through iterative denoising — imagine a paragraph that "comes into focus" over multiple passes rather than being typed letter by letter.
It's a 26B-parameter Mixture of Experts (MoE) model built on the Gemma 4 backbone that activates only 3.8B parameters during inference. The paradigm shift isn't model size or dataset — it's the generation algorithm. Standard autoregressive LLMs (GPT, Llama, Qwen, Gemma) are causal — each token depends only on previous tokens. DiffusionGemma uses bidirectional attention during generation, meaning every token in a 256-token canvas can attend to every other token simultaneously. This enables self-correction: if the model becomes less confident about a token mid-generation, it can "re-noise" and replace it — something autoregressive models cannot do because they commit each token permanently.
Google released it under Apache 2.0 on June 10, 2026. Weights are on HuggingFace at google/diffusiongemma-26B-A4B-it. It supports text, image, and video inputs (multimodal), a 256K token context window, and 140+ languages.
Why DiffusionGemma Matters
1. It changes the inference bottleneck
Autoregressive models are memory-bandwidth-bound — the GPU spends most of its time loading weights just to generate 1 token at a time. DiffusionGemma drafts and refines 256 tokens in parallel, keeping tensor cores busy and hitting 700–1,000+ tokens/s on dedicated GPUs.
This speed comes with a quality tradeoff. Google is explicit: "DiffusionGemma's overall output quality is lower than standard Gemma 4." The value prop is NOT "better LLM" — it's "4x faster at acceptable quality for specific workloads."
2. Runs on 18GB VRAM with quantization
A 26B-class model fitting on a single consumer GPU at 4-bit, running at potentially 200–700+ tokens/s, opens a practical speed tier that didn't exist before. Comparable autoregressive 25–30B models need 24GB+ and run at 50–80 tokens/s.
Not all hardware benefits equally. Apple Silicon Macs and low-compute GPUs (RTX 3060, 4060) won't see the speedup — the compute-bound advantage disappears when the bottleneck shifts back to memory bandwidth.
3. Apache 2.0 — truly permissive
Unlike Llama's custom license or Gemma 3's restrictive terms, Apache 2.0 means you can build products on this, fine-tune it, and distribute it commercially without restriction. The real constraint isn't the license — it's that this is a new paradigm with less mature tooling.
4. Local-first design
DiffusionGemma is optimized for single-user, low-concurrency local inference. In high-QPS cloud serving, autoregressive models can batch users efficiently — so DiffusionGemma's advantage paradoxically narrows in cloud deployment. For individual developers or small teams running locally, this is where it shines.
How to Run DiffusionGemma Locally
Prerequisites: Hardware
| Quantization | VRAM | Tested GPUs | Expected Speed |
|---|---|---|---|
| 4-bit (Q4_K_M) | 15–17 GB | RTX 4090, 5090, 3090 (24GB) | ~200–400 tokens/s |
| 8-bit (Q8_0) | 27–29 GB | RTX 5090 (32GB), A6000 | ~500–700+ tokens/s |
| FP16 / BF16 | 52 GB | H100, A100, H200, dual 4090 | 1,000+ tokens/s |
| NVFP4 (NVIDIA 4-bit) | ~15 GB | Blackwell (5090, RTX PRO) | 700+ tokens/s |
Cards with limited compute (RTX 3060, RTX 4060, laptop GPUs) and Apple Silicon Macs will see significantly reduced speedups.
Method 1: vLLM Recommended
vLLM has day-zero support with optimized block denoising kernels. This is the fastest path — and the only one with independently verified speed numbers (1,288 tokens/s on H200).
pip install vllm>=0.12.0
from vllm import LLM, SamplingParams
llm = LLM(
model="google/diffusiongemma-26B-A4B-it",
trust_remote_code=True,
tensor_parallel_size=1,
max_model_len=65536,
)
sampling_params = SamplingParams(
temperature=0.7,
max_tokens=2048,
)
outputs = llm.generate(["Explain how diffusion language models work."], sampling_params)
for output in outputs:
print(output.outputs[0].text)
Method 2: llama.cpp Bleeding Edge
llama.cpp support is via unmerged PR #24427. Clone and checkout the PR branch:
git clone https://github.com/ggml-ai/llama.cpp
cd llama.cpp
git fetch origin pull/24427/head:diffusiongemma
git checkout diffusiongemma
# Build with CUDA
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release
# Run with GGUF quantized weights
./build/bin/llama-cli \
-m DiffusionGemma-26B-A4B-it-Q4_K_M.gguf \
-p "Explain how diffusion language models work." \
-n 512 \
-ngl 99
This is bleeding edge — expect bugs. Check PR #24427 for latest status before using in production.
Method 3: Unsloth
pip install unsloth
pip install flash-attn --no-build-isolation # optional
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
model_name="google/diffusiongemma-26B-A4B-it",
max_seq_length=65536,
load_in_4bit=True, # fits in ~16GB
)
Unsloth provides a hardware-requirements table at unsloth.ai/docs/models/diffusiongemma with VRAM estimates for all quantization levels.
Method 4: HuggingFace Transformers
Simplest path for a quick first test:
pip install transformers>=4.55.0 accelerate
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "google/diffusiongemma-26B-A4B-it"
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
inputs = tokenizer("Explain diffusion language models in one paragraph.", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Method 5: Ollama Community Workaround
Ollama support is coming via llama.cpp PR #24427. Until the PR merges, you can use the bridge workaround:
- Build llama.cpp with diffusion support (PR #24427 branch)
- Download or convert a GGUF model
- Create a Modelfile and run with
ollama create
Quantization for Consumer GPUs
NVFP4 (Blackwell GPUs only: RTX 5090 / RTX PRO 6000)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=nvfp4,
device_map="auto",
trust_remote_code=True,
)
Bitsandbytes 4-bit (Ampere+: RTX 3090, 4090)
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True,
)
GGUF Q4_K_M (llama.cpp path)
Download community-converted GGUF files from HuggingFace. ~16GB VRAM, ~200–400 tokens/s on RTX 4090 (community estimate).
Troubleshooting
| Error | Cause | Fix |
|---|---|---|
trust_remote_code=True required | Custom diffusion attention modules not in standard transformers | Add trust_remote_code=True to all from_pretrained() calls |
| OOM on 16GB GPU at 4-bit | Bitsandbytes 4-bit estimates optimistic for 26B-class | Try GGUF Q4_K_M via llama.cpp PR #24427, or use device_map="sequential" |
KeyError: 'denoising_steps' | Using standard model.generate() without diffusion params | Check model card for num_denoising_steps, denoising_temperature |
| Slow on RTX 3060/4060/laptop GPU | Compute-limited cards can't saturate tensor cores | Speed advantage requires high-CUDA-core GPUs. Consider autoregressive models. |
| Apple Silicon: no speedup | Unified memory shifts bottleneck back to bandwidth | Google explicitly warns about this. Expect autoregressive-comparable performance. |
Real Performance & Honest Assessment
Speed Benchmarks
| Hardware | Tokens/s | Quantization | Source |
|---|---|---|---|
| H200 | 1,288 | BF16 (vLLM) | vLLM blog — independently verified |
| H100 | ~1,000+ | BF16 | Google (self-reported) — not independently confirmed |
| RTX 5090 | ~700+ | NVFP4 | Google (self-reported) — not independently confirmed |
| RTX 4090 | ~200–400 | Q4_K_M (GGUF) | Community estimate — no published benchmarks yet |
| RTX 3090 | ~150–300 | Q4_K_M (GGUF) | Community estimate — no published benchmarks yet |
Speed honesty note: Only the H200 1,288 tokens/s figure has independent verification (from vLLM's own blog). The 1,000+ H100 and 700+ RTX 5090 numbers are Google's self-reported benchmarks. Community benchmarks will emerge over the next 1–2 weeks.
Quality vs. Autoregressive Models
Where it's competitive
- Short-form text generation (1–2 paragraphs)
- Summarization & factual extraction
- Translation (140+ languages)
- Data augmentation / synthetic text
- Chat / basic Q&A
Where it falls short
- Complex reasoning (math, logic, multi-hop)
- Long-form creative writing (3+ paragraphs)
- Code generation (multi-function/multi-file)
- Nuanced instruction following
Current Limitations
Quality drop is real and permanent
Google is honest: DiffusionGemma is not and will not be "Gemma 4 quality at 4x speed." It's a different tradeoff. Accept it or choose a different model.
Tooling is bleeding edge
llama.cpp support is an unmerged PR. Ollama doesn't work. Community GGUF quantizations may vary in quality. The ecosystem will mature, but in June 2026, you're an early adopter.
Hardware pickiness
The speed advantage collapses on low-compute GPUs (3060, 4060, laptops) and Apple Silicon. Check your hardware before getting excited about those 700+ tokens/s numbers.
Fine-tuning is uncharted territory
No established LoRA/QLoRA recipes, no community adapter ecosystem. Tuning a diffusion model is conceptually different from autoregressive — you're tuning denoising behavior, not next-token prediction.
Use Cases
Local LLM Enthusiast with a 24GB GPU
On an RTX 4090 or 3090, DiffusionGemma at 4-bit fits in ~16GB VRAM and delivers potentially 200–400 tokens/s — a speed tier no autoregressive 20B+ model can touch on consumer hardware. For comparison: Qwen 2.5 32B at Q4 runs at ~40–60 tokens/s on a 4090.
Synthetic Data / Data Augmentation
When you need to generate thousands of text samples and inference intelligence doesn't need to be SOTA — just speed and hardware efficiency. Block-level parallel generation makes it ideal for batch synthetic data where volume matters more than per-sample reasoning depth.
Real-Time Interactive Applications
Chat interfaces, live transcription summarization, code autocomplete, content moderation — any scenario where sub-100ms latency changes the user experience. Even a 50 tokens/s autoregressive model has noticeable lag. DiffusionGemma at 700+ tokens/s makes iteration feel instantaneous.
Local-First / Privacy-Sensitive Deployment
Building a product that must run fully offline (legal, healthcare, defense)? Apache 2.0 means no license negotiation. 4-bit fits on a single consumer GPU. You don't need top-tier reasoning — you need fast, private, on-device text processing.
When NOT to Use DiffusionGemma
You won't see the speed advantage. Use Llama 4 or Qwen 3.6 instead.
The quality gap vs. autoregressive models of similar size is significant.
Traditional batched autoregressive inference may be more cost-effective.
The ecosystem isn't ready. Wait 2–3 months.
FAQ
What exactly is a "diffusion language model"? How is it different from GPT/Llama?
Autoregressive models generate one token at a time, left-to-right, each token depending on all previous ones. A diffusion model works on 256-token blocks simultaneously — it starts with random noise, then iteratively "denoises" toward coherent text, letting every token in the block attend to every other token bidirectionally.
Think of it as the difference between typing a sentence word by word (autoregressive) vs. giving a blurry photo of a sentence and letting it sharpen into focus over multiple passes (diffusion). The key consequence: autoregressive models can't go back and fix tokens they already generated. Diffusion models can.
Can I run this on my Apple Silicon Mac (M3/M4 Max/Ultra)?
You can run it, but you probably won't get the 4x speedup. Google's blog explicitly warns that unified-memory architectures may not see the same acceleration. On Apple Silicon, expect performance similar to autoregressive models of comparable size. If you're on Mac, stick with Llama or Qwen.
How does the quality compare to Llama 4 or Qwen 3.6?
Google is transparent: "DiffusionGemma's overall output quality is lower than standard Gemma 4." For short-form generation and summarization, quality is competitive. For complex reasoning, long-form writing, and code, autoregressive models win by a meaningful margin. Exact benchmark comparisons vs. Llama 4 and Qwen 3.6 will emerge as the community runs them.
Is there Ollama support?
No. llama.cpp support exists as PR #24427 (unmerged as of June 11, 2026). Ollama typically integrates llama.cpp — once PR #24427 merges, Ollama support will follow. Timeline: weeks to months, not days.
Can I fine-tune DiffusionGemma?
Theoretically yes, practically not yet. The architecture is based on Gemma 4, so the backbone weights can be adapted. But diffusion fine-tuning (tuning denoising behavior vs. next-token prediction) requires different loss functions and training recipes. No established LoRA/QLoRA pipelines exist. If fine-tuning is in your plan, wait 2–3 months.
What GPU do I actually need to see the speedup?
RTX 3090 or better, or the Blackwell generation (RTX 5090, RTX PRO 6000) for NVFP4 native 4-bit. RTX 3060, 4060, and laptop GPUs have insufficient compute to saturate tensor cores for block denoising. Rule of thumb: if your GPU can't do BF16 at >100 TFLOPS, the speed advantage diminishes significantly.
Are the 1,000+ tokens/s H100 and 700+ tokens/s RTX 5090 numbers real?
The only independently verified number is vLLM's 1,288 tokens/s on H200. The 1,000+ H100 and 700+ RTX 5090 figures are Google's self-reported benchmarks — conditions not publicly documented. Community benchmarks will emerge within 1–2 weeks. Treat Google's numbers as "best-case under optimized conditions" until independent testing confirms.
Can I use this commercially?
Yes. Apache 2.0 is one of the most permissive open-source licenses. You can use it in commercial products, modify it, distribute it, and you don't need to open-source your modifications. Unlike Llama's restrictive terms (>700M monthly active users = need Meta's permission) or Gemma's source-available terms, there are no usage-based restrictions.
When should I use DiffusionGemma vs. sticking with Qwen/Llama/Gemma?
Use DiffusionGemma if: (1) you're on a high-compute GPU (4090+), (2) throughput matters more than top-tier quality, (3) you're doing synthetic data, real-time chat, or draft generation. Stick with autoregressive if: (1) you're on Apple Silicon or a lower-end GPU, (2) you need complex reasoning, code, or long-form writing, (3) you need stable, production-tested tooling with Ollama/llama.cpp support.