Can I use DiffusionGemma commercially?

Yes. Apache 2.0 is one of the most permissive open-source licenses. You can use it in commercial products, modify it, distribute it, and don't need to open-source your modifications. Unlike Llama's restrictive terms, there are no usage-based restrictions.

Google DeepMind · Apache 2.0

Run Google's Diffusion LLM Locally — DiffusionGemma

Q: How does the quality compare to Llama 4 or Qwen 3.6?

Google is transparent: DiffusionGemma's overall output quality is lower than standard Gemma 4. For short-form generation and summarization, quality is competitive. For complex reasoning, long-form writing, and code, autoregressive models win.

Q: Can I fine-tune DiffusionGemma?

Theoretically yes, practically not yet. Diffusion fine-tuning requires different loss functions and training recipes. No established LoRA/QLoRA pipelines exist yet. If fine-tuning is essential, wait 2-3 months for the ecosystem to mature.

Q: What GPU do I need to see the speedup?

RTX 3090 or better, or Blackwell generation (RTX 5090, RTX PRO 6000) for NVFP4 native 4-bit. RTX 3060, 4060, and laptop GPUs have insufficient compute. Rule of thumb: if your GPU can't do BF16 at >100 TFLOPS, the speed advantage diminishes.

Q: Are the 1,000+ tokens/s on H100 and 700+ tokens/s on RTX 5090 numbers real?

Only vLLM's 1,288 tokens/s on H200 has independent verification. The 1,000+ H100 and 700+ RTX 5090 figures are Google's self-reported benchmarks. Community benchmarks will emerge within 1-2 weeks. Treat Google's numbers as best-case until independently confirmed.

Q: When should I use DiffusionGemma vs. sticking with Qwen/Llama/Gemma?

Use DiffusionGemma if you are on a high-compute GPU (4090+), throughput matters more than top-tier quality, or you are doing synthetic data, real-time chat, or draft generation. Stick with autoregressive if you are on Apple Silicon or a lower-end GPU, need complex reasoning or code, or need stable production-tested tooling.

A 26B MoE model that generates 256-token blocks in parallel — 4x faster than autoregressive. Fits on a single 24GB GPU with quantization.

Get Started How It Works

1,000+ tokens/s on H100

700+ tokens/s on RTX 5090

18 GB VRAM with quantization

What Is DiffusionGemma

DiffusionGemma generates text like an image diffusion model: instead of typing one word at a time left-to-right, it fills in a whole 256-token block at once through iterative denoising — imagine a paragraph that "comes into focus" over multiple passes rather than being typed letter by letter.

It's a 26B-parameter Mixture of Experts (MoE) model built on the Gemma 4 backbone that activates only 3.8B parameters during inference. The paradigm shift isn't model size or dataset — it's the generation algorithm. Standard autoregressive LLMs (GPT, Llama, Qwen, Gemma) are causal — each token depends only on previous tokens. DiffusionGemma uses bidirectional attention during generation, meaning every token in a 256-token canvas can attend to every other token simultaneously. This enables self-correction: if the model becomes less confident about a token mid-generation, it can "re-noise" and replace it — something autoregressive models cannot do because they commit each token permanently.

Google released it under Apache 2.0 on June 10, 2026. Weights are on HuggingFace at google/diffusiongemma-26B-A4B-it. It supports text, image, and video inputs (multimodal), a 256K token context window, and 140+ languages.

Why DiffusionGemma Matters

1. It changes the inference bottleneck

Autoregressive models are memory-bandwidth-bound — the GPU spends most of its time loading weights just to generate 1 token at a time. DiffusionGemma drafts and refines 256 tokens in parallel, keeping tensor cores busy and hitting 700–1,000+ tokens/s on dedicated GPUs.

This speed comes with a quality tradeoff. Google is explicit: "DiffusionGemma's overall output quality is lower than standard Gemma 4." The value prop is NOT "better LLM" — it's "4x faster at acceptable quality for specific workloads."

2. Runs on 18GB VRAM with quantization

A 26B-class model fitting on a single consumer GPU at 4-bit, running at potentially 200–700+ tokens/s, opens a practical speed tier that didn't exist before. Comparable autoregressive 25–30B models need 24GB+ and run at 50–80 tokens/s.

Not all hardware benefits equally. Apple Silicon Macs and low-compute GPUs (RTX 3060, 4060) won't see the speedup — the compute-bound advantage disappears when the bottleneck shifts back to memory bandwidth.

3. Apache 2.0 — truly permissive

Unlike Llama's custom license or Gemma 3's restrictive terms, Apache 2.0 means you can build products on this, fine-tune it, and distribute it commercially without restriction. The real constraint isn't the license — it's that this is a new paradigm with less mature tooling.

4. Local-first design

DiffusionGemma is optimized for single-user, low-concurrency local inference. In high-QPS cloud serving, autoregressive models can batch users efficiently — so DiffusionGemma's advantage paradoxically narrows in cloud deployment. For individual developers or small teams running locally, this is where it shines.

How to Run DiffusionGemma Locally

Prerequisites: Hardware

Quantization	VRAM	Tested GPUs	Expected Speed
4-bit (Q4_K_M)	15–17 GB	RTX 4090, 5090, 3090 (24GB)	~200–400 tokens/s
8-bit (Q8_0)	27–29 GB	RTX 5090 (32GB), A6000	~500–700+ tokens/s
FP16 / BF16	52 GB	H100, A100, H200, dual 4090	1,000+ tokens/s
NVFP4 (NVIDIA 4-bit)	~15 GB	Blackwell (5090, RTX PRO)	700+ tokens/s

Cards with limited compute (RTX 3060, RTX 4060, laptop GPUs) and Apple Silicon Macs will see significantly reduced speedups.

Method 1: vLLM Recommended

vLLM has day-zero support with optimized block denoising kernels. This is the fastest path — and the only one with independently verified speed numbers (1,288 tokens/s on H200).

pip install vllm>=0.12.0

from vllm import LLM, SamplingParams

llm = LLM(
    model="google/diffusiongemma-26B-A4B-it",
    trust_remote_code=True,
    tensor_parallel_size=1,
    max_model_len=65536,
)

sampling_params = SamplingParams(
    temperature=0.7,
    max_tokens=2048,
)

outputs = llm.generate(["Explain how diffusion language models work."], sampling_params)
for output in outputs:
    print(output.outputs[0].text)

Method 2: llama.cpp Bleeding Edge

llama.cpp support is via unmerged PR #24427. Clone and checkout the PR branch:

git clone https://github.com/ggml-ai/llama.cpp
cd llama.cpp
git fetch origin pull/24427/head:diffusiongemma
git checkout diffusiongemma

# Build with CUDA
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release

# Run with GGUF quantized weights
./build/bin/llama-cli \
  -m DiffusionGemma-26B-A4B-it-Q4_K_M.gguf \
  -p "Explain how diffusion language models work." \
  -n 512 \
  -ngl 99

This is bleeding edge — expect bugs. Check PR #24427 for latest status before using in production.

Method 3: Unsloth

pip install unsloth
pip install flash-attn --no-build-isolation  # optional

from unsloth import FastModel

model, tokenizer = FastModel.from_pretrained(
    model_name="google/diffusiongemma-26B-A4B-it",
    max_seq_length=65536,
    load_in_4bit=True,  # fits in ~16GB
)

Unsloth provides a hardware-requirements table at unsloth.ai/docs/models/diffusiongemma with VRAM estimates for all quantization levels.

Method 4: HuggingFace Transformers

Simplest path for a quick first test:

pip install transformers>=4.55.0 accelerate

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "google/diffusiongemma-26B-A4B-it"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
inputs = tokenizer("Explain diffusion language models in one paragraph.", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Method 5: Ollama Community Workaround

Ollama support is coming via llama.cpp PR #24427. Until the PR merges, you can use the bridge workaround:

Build llama.cpp with diffusion support (PR #24427 branch)
Download or convert a GGUF model
Create a Modelfile and run with ollama create

Full Ollama Setup Guide →

Quantization for Consumer GPUs

NVFP4 (Blackwell GPUs only: RTX 5090 / RTX PRO 6000)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=nvfp4,
    device_map="auto",
    trust_remote_code=True,
)

Bitsandbytes 4-bit (Ampere+: RTX 3090, 4090)

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

GGUF Q4_K_M (llama.cpp path)

Download community-converted GGUF files from HuggingFace. ~16GB VRAM, ~200–400 tokens/s on RTX 4090 (community estimate).

Troubleshooting

Error	Cause	Fix
`trust_remote_code=True` required	Custom diffusion attention modules not in standard transformers	Add `trust_remote_code=True` to all `from_pretrained()` calls
OOM on 16GB GPU at 4-bit	Bitsandbytes 4-bit estimates optimistic for 26B-class	Try GGUF Q4_K_M via llama.cpp PR #24427, or use `device_map="sequential"`
`KeyError: 'denoising_steps'`	Using standard `model.generate()` without diffusion params	Check model card for `num_denoising_steps`, `denoising_temperature`
Slow on RTX 3060/4060/laptop GPU	Compute-limited cards can't saturate tensor cores	Speed advantage requires high-CUDA-core GPUs. Consider autoregressive models.
Apple Silicon: no speedup	Unified memory shifts bottleneck back to bandwidth	Google explicitly warns about this. Expect autoregressive-comparable performance.

Real Performance & Honest Assessment

Speed Benchmarks

Hardware	Tokens/s	Quantization	Source
H200	1,288	BF16 (vLLM)	vLLM blog — independently verified
H100	~1,000+	BF16	Google (self-reported) — not independently confirmed
RTX 5090	~700+	NVFP4	Google (self-reported) — not independently confirmed
RTX 4090	~200–400	Q4_K_M (GGUF)	Community estimate — no published benchmarks yet
RTX 3090	~150–300	Q4_K_M (GGUF)	Community estimate — no published benchmarks yet

Speed honesty note: Only the H200 1,288 tokens/s figure has independent verification (from vLLM's own blog). The 1,000+ H100 and 700+ RTX 5090 numbers are Google's self-reported benchmarks. Community benchmarks will emerge over the next 1–2 weeks.

Quality vs. Autoregressive Models

Where it's competitive

Short-form text generation (1–2 paragraphs)
Summarization & factual extraction
Translation (140+ languages)
Data augmentation / synthetic text
Chat / basic Q&A

Where it falls short

Complex reasoning (math, logic, multi-hop)
Long-form creative writing (3+ paragraphs)
Code generation (multi-function/multi-file)
Nuanced instruction following

Current Limitations

Quality drop is real and permanent

Google is honest: DiffusionGemma is not and will not be "Gemma 4 quality at 4x speed." It's a different tradeoff. Accept it or choose a different model.

Tooling is bleeding edge

llama.cpp support is an unmerged PR. Ollama doesn't work. Community GGUF quantizations may vary in quality. The ecosystem will mature, but in June 2026, you're an early adopter.

Hardware pickiness

The speed advantage collapses on low-compute GPUs (3060, 4060, laptops) and Apple Silicon. Check your hardware before getting excited about those 700+ tokens/s numbers.

Fine-tuning is uncharted territory

No established LoRA/QLoRA recipes, no community adapter ecosystem. Tuning a diffusion model is conceptually different from autoregressive — you're tuning denoising behavior, not next-token prediction.

Use Cases

Local LLM Enthusiast with a 24GB GPU

On an RTX 4090 or 3090, DiffusionGemma at 4-bit fits in ~16GB VRAM and delivers potentially 200–400 tokens/s — a speed tier no autoregressive 20B+ model can touch on consumer hardware. For comparison: Qwen 2.5 32B at Q4 runs at ~40–60 tokens/s on a 4090.

Synthetic Data / Data Augmentation

When you need to generate thousands of text samples and inference intelligence doesn't need to be SOTA — just speed and hardware efficiency. Block-level parallel generation makes it ideal for batch synthetic data where volume matters more than per-sample reasoning depth.

Real-Time Interactive Applications

Chat interfaces, live transcription summarization, code autocomplete, content moderation — any scenario where sub-100ms latency changes the user experience. Even a 50 tokens/s autoregressive model has noticeable lag. DiffusionGemma at 700+ tokens/s makes iteration feel instantaneous.

Local-First / Privacy-Sensitive Deployment

Building a product that must run fully offline (legal, healthcare, defense)? Apache 2.0 means no license negotiation. 4-bit fits on a single consumer GPU. You don't need top-tier reasoning — you need fast, private, on-device text processing.

When NOT to Use DiffusionGemma

Apple Silicon (M1/M2/M3/M4)

You won't see the speed advantage. Use Llama 4 or Qwen 3.6 instead.

Complex reasoning or code gen

The quality gap vs. autoregressive models of similar size is significant.

Cloud API serving many concurrent users

Traditional batched autoregressive inference may be more cost-effective.

Need fine-tuning now

The ecosystem isn't ready. Wait 2–3 months.

FAQ

What exactly is a "diffusion language model"? How is it different from GPT/Llama?

Autoregressive models generate one token at a time, left-to-right, each token depending on all previous ones. A diffusion model works on 256-token blocks simultaneously — it starts with random noise, then iteratively "denoises" toward coherent text, letting every token in the block attend to every other token bidirectionally.

Think of it as the difference between typing a sentence word by word (autoregressive) vs. giving a blurry photo of a sentence and letting it sharpen into focus over multiple passes (diffusion). The key consequence: autoregressive models can't go back and fix tokens they already generated. Diffusion models can.

Can I run this on my Apple Silicon Mac (M3/M4 Max/Ultra)?

You can run it, but you probably won't get the 4x speedup. Google's blog explicitly warns that unified-memory architectures may not see the same acceleration. On Apple Silicon, expect performance similar to autoregressive models of comparable size. If you're on Mac, stick with Llama or Qwen.

How does the quality compare to Llama 4 or Qwen 3.6?

Google is transparent: "DiffusionGemma's overall output quality is lower than standard Gemma 4." For short-form generation and summarization, quality is competitive. For complex reasoning, long-form writing, and code, autoregressive models win by a meaningful margin. Exact benchmark comparisons vs. Llama 4 and Qwen 3.6 will emerge as the community runs them.

Is there Ollama support?

No. llama.cpp support exists as PR #24427 (unmerged as of June 11, 2026). Ollama typically integrates llama.cpp — once PR #24427 merges, Ollama support will follow. Timeline: weeks to months, not days.

Can I fine-tune DiffusionGemma?

Theoretically yes, practically not yet. The architecture is based on Gemma 4, so the backbone weights can be adapted. But diffusion fine-tuning (tuning denoising behavior vs. next-token prediction) requires different loss functions and training recipes. No established LoRA/QLoRA pipelines exist. If fine-tuning is in your plan, wait 2–3 months.

What GPU do I actually need to see the speedup?

RTX 3090 or better, or the Blackwell generation (RTX 5090, RTX PRO 6000) for NVFP4 native 4-bit. RTX 3060, 4060, and laptop GPUs have insufficient compute to saturate tensor cores for block denoising. Rule of thumb: if your GPU can't do BF16 at >100 TFLOPS, the speed advantage diminishes significantly.

Are the 1,000+ tokens/s H100 and 700+ tokens/s RTX 5090 numbers real?

The only independently verified number is vLLM's 1,288 tokens/s on H200. The 1,000+ H100 and 700+ RTX 5090 figures are Google's self-reported benchmarks — conditions not publicly documented. Community benchmarks will emerge within 1–2 weeks. Treat Google's numbers as "best-case under optimized conditions" until independent testing confirms.

Can I use this commercially?

Yes. Apache 2.0 is one of the most permissive open-source licenses. You can use it in commercial products, modify it, distribute it, and you don't need to open-source your modifications. Unlike Llama's restrictive terms (>700M monthly active users = need Meta's permission) or Gemma's source-available terms, there are no usage-based restrictions.

When should I use DiffusionGemma vs. sticking with Qwen/Llama/Gemma?

Use DiffusionGemma if: (1) you're on a high-compute GPU (4090+), (2) throughput matters more than top-tier quality, (3) you're doing synthetic data, real-time chat, or draft generation. Stick with autoregressive if: (1) you're on Apple Silicon or a lower-end GPU, (2) you need complex reasoning, code, or long-form writing, (3) you need stable, production-tested tooling with Ollama/llama.cpp support.