GGUF Quantization: Q4_K_M, Q5_K_M, Q6_K or Q8_0 — How to Choose Without Wrecking Quality

The practical guide to picking your GGUF quant in 2026: bits per weight, perplexity impact, imatrix, and a VRAM/quality table. A Llama 3.1 8B drops from 32 GB in F32 to 4.9 GB in Q4_K_M.

You download a model from Hugging Face and hit an endless list: Q4_K_M, Q5_K_M, Q6_K, Q8_0, not to mention IQ4_XS and friends like Q3_K_S. Fifteen files of the same model, sizes ranging from one to double, and zero clear explanation. The result: people grab the biggest one "to be safe," saturate their VRAM, and the model crawls. Or the opposite: they grab the smallest, and the model gets dumb without anyone understanding why.

This article takes the topic apart without the useless jargon. The goal: in 2 minutes, you know exactly which quant to download for your machine and your use case, backed by real numbers.

What does quantizing actually mean?

A LLM stores its weights in 16-bit floating point (FP16/BF16) straight out of training. For Llama 3.1 8B, that's about 16 GB (32 GB in F32). Quantization means reducing the number of bits per weight: 8, 6, 5, 4, even 2 bits. Fewer bits = smaller file, less VRAM, but a loss of precision on each weight.

The GGUF format (GPT-Generated Unified Format), created by the llama.cpp project, is the single-file standard bundling quantized weights, metadata, and tokenizer. It's what Ollama, LM Studio, and llama.cpp use. If you run local AI, you handle GGUF, full stop.

Bits per weight: don't trust the number in the name

Trap number one: believing "Q4" means exactly 4 bits per weight. Wrong. K-quants (the K in Q4_K_M) use a super-block structure with double quantization and a layer-aware approach: attention and output layers get more bits, feed-forward layers get fewer. So effective bits per weight (bpw) end up non-integer:

  • Q4_K_M — ~4.9 effective bpw
  • Q5_K_M — ~5.5 bpw
  • Q6_K — ~6.5 bpw
  • Q8_0 — 8.0 bpw (near-lossless)

The _S / _M / _L suffix (Small / Medium / Large) tunes the size/fidelity trade-off at the same bit-depth. Strategy matters more than the raw bit count. Killer proof: a Q4_K_M (perplexity delta +0.0535 vs FP16) absolutely crushes an old legacy Q4_0 (+0.2499) — same bit budget, but a radically smarter allocation. Practical consequence: avoid the legacy Q4_0 / Q4_1 / Q5_0 formats, they have no reason to exist anymore.

The real impact on quality (perplexity)

Perplexity measures how "surprised" the model is by a reference text: lower is better. Taking Q4_K_M as 100 % baseline, relative quality retention looks like this:

  • Q2_K: ~85 % — sharp drop, avoid unless hardware-desperate
  • Q3_K_M: ~90 %
  • Q4_K_M: 100 % (practical reference)
  • Q5_K_M: ~101.5 %
  • Q6_K: ~102 %
  • Q8_0: ~103 % — virtually indistinguishable from FP16

Q8_0 deserves a special mention: its perplexity increase against FP16 is roughly 0.01 points (e.g. 6.00 → 6.01). It's essentially lossless, and its simple dequantization path makes it fast for CPU inference. But look at the scale: between Q5_K_M and Q8_0, you gain ~1.5 % quality for ~50 % more size. Diminishing returns are brutal at the top end.

The perplexity trap: not all use cases are equal

Here's what the tables don't tell you. Perplexity is a statistical average over general text. But quantization doesn't degrade all tasks uniformly: it hits code and math/STEM harder than conversation, and above all it amplifies the model's pre-existing weaknesses rather than creating new ones evenly.

In concrete terms: for general chat assistant work, Q4_K_M goes unnoticed. For code generation or demanding mathematical reasoning, step up to Q5_K_M or Q6_K — the difference becomes tangible on the tricky cases. And always validate on your own prompts: a perplexity benchmark is only a guide, never a guarantee on your real task.

Imatrix: the free lever everyone ignores

The importance matrix (imatrix) is a calibration technique that identifies which weights matter most. You run calibration data through the model (the llama-imatrix tool), and the quantizer keeps the "important" weights more carefully via the --imatrix flag.

Two misconceptions to kill:

  • Imatrix is not exclusive to IQ-quants. You can (and should) apply it to classic K-quants too. It's free quality. If imatrix and i-quants appeared at the same time, it's mostly because the first i-quant was a 2-bit one, unusable without calibration.
  • You can't tell from the filename whether an imatrix was used. Check the model card on Hugging Face. bartowski quants, for example, are systematically imatrix-based.

Golden rule: below 4 bits (and especially ≤ 3 bits), imatrix is essential. Above that, it remains a nice bonus. As for fears of language bias from calibration: tests in llama.cpp discussion #5263 show that any dataset does better than a "vanilla" quantization with no imatrix.

K-quants vs IQ-quants: quality per byte against speed

IQ-quants (i-quants) are the next generation, built around the imatrix. Their strength: quality per byte. Concrete numbers on Llama 3.1 8B (llama.cpp figures): IQ4_XS = ~4.46 bpw / 4.17 GiB versus Q4_K_M = ~4.89 bpw / 4.58 GiB. IQ4_XS is more compact and slightly faster at generation, but slower at prompt processing, and above all more sensitive to calibration quality.

The real deciding factor is often hardware: i-quants compress more but decode slower on CPU; K-quants often give better tokens/sec on consumer hardware, old GPUs, Macs, or pure CPU inference. The community guidance is simple: test both on your machine. On that note, understanding the relationship between VRAM, RAM and model size changes everything when calibrating your expectations.

Real sizes and choosing by hardware

Here are the concrete sizes of bartowski Llama 3.1 8B quants on Hugging Face — the reference everyone downloads:

  • Q4_K_M: 4.92 GB — the sensible default
  • Q5_K_M: 5.73 GB — the quality/size sweet spot
  • Q6_K: 6.6 GB — balanced near-lossless
  • Q8_0: 8.54 GB — near-lossless, but ~2× Q4_K_M
  • (F32 reference: 32.1 GB)

Decision tree by your VRAM (keeping headroom for context/KV-cache):

  • < 8 GB VRAM → Q3_K_M or IQ4_XS (with imatrix)
  • 8–12 GB VRAMQ4_K_M (the default choice for most)
  • 12–16 GB VRAM → Q5_K_M or Q6_K (recommended for code/math)
  • 16–24 GB VRAM → Q8_0 if you want near-lossless
  • 24 GB+ → Q6_K remains an excellent balanced default for serving 7B–14B; keep F16 for truly critical cases

Good to know: Ollama downloads Q4_K_M by default. A plain ollama pull llama3.1:8b gives you Q4_K_M. For another quant: ollama pull llama3.1:8b-q8_0. And to grab a specific file from Hugging Face: huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF --include "*Q5_K_M.gguf" --local-dir ./. If you orchestrate this with modern clients, see our Ollama integrations (Codex, Claude, OpenCLAW).

The verdict

No suspense, here's the honest hierarchy:

  • Q4_K_M = the universal default. 72 % VRAM saved, 92–95 % of quality retained. If you hesitate, pick this.
  • Q5_K_M = the sweet spot. +15–20 % VRAM for real safety margin on code and reasoning. My default choice whenever the VRAM is there.
  • Q6_K = balanced near-lossless. Ideal for serving 7B–14B on 24–48 GB when you want peace of mind.
  • Q8_0 = for the near-lossless. But 2× the size of Q4_K_M for ~3 % more quality: reserve it for cases where every token counts (reference datasets, distillation).

The deep lesson: don't chase the last percentage points of perplexity at the cost of your VRAM. The right reflex in 2026 is to take the biggest quant that fits comfortably with your context, prefer imatrix files, and validate on your real prompts. The rest is cosmetics.

Going further: size your machine with our VRAM/RAM calculation for local LLMs, compare hardware platforms with AMD Strix Halo (Ryzen AI Max 395) and Mac Studio M4 Max vs M3 Ultra, and find out which models to download in our overview of new local LLM models 2026 on Ollama.

Did you enjoy this article?

Comments

Morgann Riu

Cybersecurity and Linux administration expert. I help companies secure and optimize their critical infrastructures.

Back to the blog

Checklist Sécurité Linux

30 points essentiels pour sécuriser un serveur Linux. Recevez aussi les nouveaux tutoriels par email.

Pas de spam. Désabonnement en 1 clic.