GGUF Format · Quantization

DiffusionGemma GGUF Guide

GGUF is the standard format for running LLMs locally with llama.cpp and Ollama. Download pre-converted files or convert from PyTorch yourself.

Download GGUF Convert Yourself

Pre-Converted GGUF Downloads

Community-converted GGUF files are available on HuggingFace. These are the most common quantization levels:

Quant	Size	VRAM Needed	Quality
Q4_0	~14GB	16GB	Good — small quality loss
Q4_K_M	~15GB	16–18GB	Best balance — recommended
Q5_K_M	~17GB	20GB	Better quality, needs more VRAM
Q8_0	~26GB	28GB+	Near-perfect, for 3090/4090

Recommended: Start with Q4_K_M. It fits on most 24GB GPUs and offers the best quality-to-size ratio. If you have a 16GB GPU, use Q4_0.

Convert from PyTorch to GGUF

Step 1: Download the Original Model

# From HuggingFace
huggingface-cli download google/diffusiongemma-26b-A4B-it \
  --local-dir ./diffusiongemma-26b

# Or use git-lfs
git lfs install
git clone https://huggingface.co/google/diffusiongemma-26b-A4B-it

Step 2: Convert to FP16 GGUF

cd llama.cpp
python convert_hf_to_gguf.py \
  ../diffusiongemma-26b \
  --outtype f16 \
  --outfile diffusiongemma-26b-f16.gguf

Step 3: Quantize to 4-bit

./build/bin/llama-quantize \
  diffusiongemma-26b-f16.gguf \
  diffusiongemma-26b-q4_k_m.gguf \
  Q4_K_M

Step 4: Verify the GGUF

./build/bin/llama-cli \
  -m diffusiongemma-26b-q4_k_m.gguf \
  -p "Hello world" \
  -n 32

Compatible Engines

GGUF files work with these inference engines:

Engine	Status	Notes
llama.cpp	Works (PR branch)	PR #24427, diffusion-support branch
Ollama	Coming soon	Depends on llama.cpp PR merge
LM Studio	Coming soon	Depends on llama.cpp upstream
GPT4All	Not yet	No timeline