llama.cpp · PR #24427
Run DiffusionGemma with llama.cpp
llama.cpp is the most popular local LLM runtime. DiffusionGemma support is in progress — here's how to build it from the PR branch and run inference today.
Build llama.cpp with Diffusion Support
Prerequisites
# Ubuntu / Debian
sudo apt install build-essential cmake git
# macOS
xcode-select --install
brew install cmake
# Verify cmake
cmake --version # needs 3.18+
Clone & Checkout PR Branch
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
git fetch origin pull/24427/head:diffusion-support
git checkout diffusion-support
Build with CUDA (NVIDIA GPU)
mkdir build && cd build
cmake .. -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)
# Verify
./bin/llama-cli --version
Build with Metal (Apple Silicon)
mkdir build && cd build
cmake .. -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
make -j$(sysctl -n hw.logicalcpu)
Build CPU-only (fallback)
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)
Run Inference
Basic text generation
./build/bin/llama-cli \
-m diffusiongemma-26b-q4_k_m.gguf \
-p "Explain how diffusion models work for text generation:" \
-n 256 \
--diffusion-steps 8 \
--temperature 0.7
Interactive chat mode
./build/bin/llama-cli \
-m diffusiongemma-26b-q4_k_m.gguf \
--interactive \
--diffusion-steps 8 \
--temp 0.7
Performance Expectations
| GPU | Quantization | Tokens/s (est.) |
|---|---|---|
| RTX 4090 | Q4_K_M | 500–700 |
| RTX 3090 | Q4_K_M | 300–500 |
| M3 Max (Metal) | Q4_0 | 100–200 |
| CPU (8+ cores) | Q4_0 | 5–15 |
Note: These are rough estimates from community testing. The diffusion speed advantage is most visible on high-end GPUs. CPU-only loses the 4x benefit.
Troubleshooting
"Unsupported model architecture" error
You're on llama.cpp main branch. You must checkout the diffusion-support branch from PR #24427.
Build fails with "cublas not found"
# Verify CUDA toolkit is installed
nvcc --version
# Set CUDA path if needed
export CUDA_PATH=/usr/local/cuda
cmake .. -DGGML_CUDA=ON -DCUDAToolkit_ROOT=$CUDA_PATH
Segfault on first token
The GGUF file might be corrupted or from an incompatible converter version. Re-download from the official source or re-convert with the latest convert.py.