Ollama Integration · Work in Progress

Run DiffusionGemma with Ollama

Ollama support is coming via llama.cpp PR #24427. Until then, here's the bridge workaround to run DiffusionGemma through Ollama today.

Current Ollama Support Status

Direct Ollama support is not available yet. DiffusionGemma requires a custom diffusion decoding pipeline that isn't merged into llama.cpp mainline. The key blocker: PR #24427 adds diffusion model support to llama.cpp. Once merged, Ollama (which builds on llama.cpp) can integrate it.

Timeline: weeks to months. The PR is under active review as of June 2026. Follow the PR for updates.

Workaround: Run DiffusionGemma Through Ollama Today

While waiting for official support, you can bridge DiffusionGemma into Ollama using a custom Modelfile that wraps the llama.cpp diffusion branch.

Step 1: Install Ollama

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows — download from https://ollama.com/download
# Or use WSL2 + Linux install above

Step 2: Build llama.cpp with Diffusion Support

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
# Checkout the diffusion PR branch
git fetch origin pull/24427/head:diffusion-support
git checkout diffusion-support

mkdir build && cd build
cmake .. -DGGML_CUDA=ON  # if you have NVIDIA GPU
make -j$(nproc)

Step 3: Download a GGUF Model

# Check /gguf/ for available GGUF downloads
# Place the model in Ollama's model directory
mkdir -p ~/.ollama/models/
# See our GGUF guide for download links

Step 4: Create a Modelfile

# Modelfile for DiffusionGemma
FROM ./diffusiongemma-26b-q4_k_m.gguf

PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_predict 256

# Diffusion-specific parameters
PARAMETER diffusion_steps 8
PARAMETER block_size 256

TEMPLATE """{{ .Prompt }}"""

Step 5: Create and Run the Model

ollama create diffusiongemma -f Modelfile
ollama run diffusiongemma

Troubleshooting

"model format not supported" error

This means your GGUF file doesn't include the diffusion architecture flag. Make sure you're using a GGUF converted from the diffusion branch of llama.cpp, not the mainline release.

OOM (Out of Memory) on 16GB GPU

Use Q4_0 or Q4_K_M quantization. The 26B model at Q4 needs ~16GB VRAM. Close other GPU processes first.

Slow generation on CPU-only

DiffusionGemma's 4x speed advantage only applies with GPU acceleration. On CPU-only, performance will be comparable to or slower than autoregressive models of similar size.

Ollama says "model not found" after create

# Verify the model was created
ollama list

# If missing, check Modelfile path is correct
ollama create diffusiongemma -f ./Modelfile