GGUF Format · Quantization
DiffusionGemma GGUF Guide
GGUF is the standard format for running LLMs locally with llama.cpp and Ollama. Download pre-converted files or convert from PyTorch yourself.
Pre-Converted GGUF Downloads
Community-converted GGUF files are available on HuggingFace. These are the most common quantization levels:
| Quant | Size | VRAM Needed | Quality |
|---|---|---|---|
| Q4_0 | ~14GB | 16GB | Good — small quality loss |
| Q4_K_M | ~15GB | 16–18GB | Best balance — recommended |
| Q5_K_M | ~17GB | 20GB | Better quality, needs more VRAM |
| Q8_0 | ~26GB | 28GB+ | Near-perfect, for 3090/4090 |
Recommended: Start with Q4_K_M. It fits on most 24GB GPUs and offers the best quality-to-size ratio. If you have a 16GB GPU, use Q4_0.
Convert from PyTorch to GGUF
Step 1: Download the Original Model
# From HuggingFace
huggingface-cli download google/diffusiongemma-26b-A4B-it \
--local-dir ./diffusiongemma-26b
# Or use git-lfs
git lfs install
git clone https://huggingface.co/google/diffusiongemma-26b-A4B-it
Step 2: Convert to FP16 GGUF
cd llama.cpp
python convert_hf_to_gguf.py \
../diffusiongemma-26b \
--outtype f16 \
--outfile diffusiongemma-26b-f16.gguf
Step 3: Quantize to 4-bit
./build/bin/llama-quantize \
diffusiongemma-26b-f16.gguf \
diffusiongemma-26b-q4_k_m.gguf \
Q4_K_M
Step 4: Verify the GGUF
./build/bin/llama-cli \
-m diffusiongemma-26b-q4_k_m.gguf \
-p "Hello world" \
-n 32
Compatible Engines
GGUF files work with these inference engines:
| Engine | Status | Notes |
|---|---|---|
| llama.cpp | Works (PR branch) | PR #24427, diffusion-support branch |
| Ollama | Coming soon | Depends on llama.cpp PR merge |
| LM Studio | Coming soon | Depends on llama.cpp upstream |
| GPT4All | Not yet | No timeline |