llama.cpp · PR #24427

Run DiffusionGemma with llama.cpp

llama.cpp is the most popular local LLM runtime. DiffusionGemma support is in progress — here's how to build it from the PR branch and run inference today.

Build llama.cpp with Diffusion Support

Prerequisites

# Ubuntu / Debian
sudo apt install build-essential cmake git

# macOS
xcode-select --install
brew install cmake

# Verify cmake
cmake --version  # needs 3.18+

Clone & Checkout PR Branch

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
git fetch origin pull/24427/head:diffusion-support
git checkout diffusion-support

Build with CUDA (NVIDIA GPU)

mkdir build && cd build
cmake .. -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)

# Verify
./bin/llama-cli --version

Build with Metal (Apple Silicon)

mkdir build && cd build
cmake .. -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
make -j$(sysctl -n hw.logicalcpu)

Build CPU-only (fallback)

mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)

Run Inference

Basic text generation

./build/bin/llama-cli \
  -m diffusiongemma-26b-q4_k_m.gguf \
  -p "Explain how diffusion models work for text generation:" \
  -n 256 \
  --diffusion-steps 8 \
  --temperature 0.7

Interactive chat mode

./build/bin/llama-cli \
  -m diffusiongemma-26b-q4_k_m.gguf \
  --interactive \
  --diffusion-steps 8 \
  --temp 0.7

Performance Expectations

GPUQuantizationTokens/s (est.)
RTX 4090Q4_K_M500–700
RTX 3090Q4_K_M300–500
M3 Max (Metal)Q4_0100–200
CPU (8+ cores)Q4_05–15

Note: These are rough estimates from community testing. The diffusion speed advantage is most visible on high-end GPUs. CPU-only loses the 4x benefit.

Troubleshooting

"Unsupported model architecture" error

You're on llama.cpp main branch. You must checkout the diffusion-support branch from PR #24427.

Build fails with "cublas not found"

# Verify CUDA toolkit is installed
nvcc --version
# Set CUDA path if needed
export CUDA_PATH=/usr/local/cuda
cmake .. -DGGML_CUDA=ON -DCUDAToolkit_ROOT=$CUDA_PATH

Segfault on first token

The GGUF file might be corrupted or from an incompatible converter version. Re-download from the official source or re-convert with the latest convert.py.