GGUF Format · Quantization

DiffusionGemma GGUF Guide

GGUF is the standard format for running LLMs locally with llama.cpp and Ollama. Download pre-converted files or convert from PyTorch yourself.

Pre-Converted GGUF Downloads

Community-converted GGUF files are available on HuggingFace. These are the most common quantization levels:

QuantSizeVRAM NeededQuality
Q4_0~14GB16GBGood — small quality loss
Q4_K_M~15GB16–18GBBest balance — recommended
Q5_K_M~17GB20GBBetter quality, needs more VRAM
Q8_0~26GB28GB+Near-perfect, for 3090/4090

Recommended: Start with Q4_K_M. It fits on most 24GB GPUs and offers the best quality-to-size ratio. If you have a 16GB GPU, use Q4_0.

Convert from PyTorch to GGUF

Step 1: Download the Original Model

# From HuggingFace
huggingface-cli download google/diffusiongemma-26b-A4B-it \
  --local-dir ./diffusiongemma-26b

# Or use git-lfs
git lfs install
git clone https://huggingface.co/google/diffusiongemma-26b-A4B-it

Step 2: Convert to FP16 GGUF

cd llama.cpp
python convert_hf_to_gguf.py \
  ../diffusiongemma-26b \
  --outtype f16 \
  --outfile diffusiongemma-26b-f16.gguf

Step 3: Quantize to 4-bit

./build/bin/llama-quantize \
  diffusiongemma-26b-f16.gguf \
  diffusiongemma-26b-q4_k_m.gguf \
  Q4_K_M

Step 4: Verify the GGUF

./build/bin/llama-cli \
  -m diffusiongemma-26b-q4_k_m.gguf \
  -p "Hello world" \
  -n 32

Compatible Engines

GGUF files work with these inference engines:

EngineStatusNotes
llama.cppWorks (PR branch)PR #24427, diffusion-support branch
OllamaComing soonDepends on llama.cpp PR merge
LM StudioComing soonDepends on llama.cpp upstream
GPT4AllNot yetNo timeline