"Will this model fit on my card?" It's THE question of local AI, and most answers out there are wrong — because they only look at the model weights and forget the rest. Result: a model that "should fit" saturates your GPU the moment the context gets longer.
This article breaks down the real formula, component by component, with actual numbers and usable tables. By the end, you'll be able to figure out in your head whether an LLM fits in your VRAM — KV cache included.
The 4 consumption buckets
An LLM's VRAM isn't just the weights. You pay for four things:
- Model weights — the parameters themselves. The biggest bucket, but not the only one.
- The KV cache — the attention state, which grows linearly with context length. The main trap.
- Activations — temporary tensors during the forward pass.
- Framework overhead — CUDA, the inference backend: typically 0.5 to 1 GB.
Quick rule: an LLM consumes about 2 GB of VRAM per billion parameters at FP16, or ~0.5 GB/billion at INT4, plus 15–20% on top for KV cache, activations and overhead.
Weights by quantization
At full FP16 precision, each parameter takes 2 bytes. A 70B therefore occupies 140 GB, a 7B occupies 14 GB: that's the floor before any optimization. Quantization melts that figure:
- FP16: 2 bytes/param → 7B = 14 GB
- Q8: 1 byte/param → 7B = 7 GB
- Q4: ~0.5 byte/param → 7B = 3.5 GB
The Q4_K_M case (the king format)
In the GGUF ecosystem (Ollama, LM Studio, llama.cpp), the dominant format is Q4_K_M. Beware: its real bits-per-weight is higher than "4" due to block metadata, scale factors and zero-points. We're talking ~4.5 to 4.8 effective bits/weight, i.e. ~0.56 GB per billion parameters:
- 7B in Q4_K_M ≈ 4 GB of weights
- 32B in Q4_K_M ≈ 18 GB
- 70B in Q4_K_M ≈ 40 GB
On quality, Q4_K_M is the sweet spot: it retains ~95% of full-precision quality for ~4× less memory. Below Q4, degradation becomes noticeable, especially on reasoning and code.
The KV cache: the hidden consumer
It's often what decides whether a model fits, especially at long context. During generation, the model stores key/value tensors for every layer, at every token position in the context. The formula:
KV_cache ≈ 2 × n_layers × n_KV_heads × head_dim × context_length × bytes_per_element
The growth is brutal with context. Some real examples:
- 8B: ~0.3 GB at 2K context → ~5 GB at 32K → ~20 GB at 128K (the KV cache alone!)
- 70B: ~1.6 GB at 2K → over 42 GB at 128K
In other words: at long context, the KV cache can exceed the model weights themselves. For a Llama 3.1 70B at 128K, the KV cache adds ~40 GB, about 29% of the total. That's the classic mistake: you size for the weights, forget the cache, and crash at 32K tokens.
Quantizing the KV cache (with caution)
You can quantize the cache itself: switching OLLAMA_KV_CACHE_TYPE from f16 to q8_0 or q4_0 cuts its size by half or more. But beware: the KV cache is more sensitive to quantization than the weights. Quality degrades faster — reserve it for scraping the last GBs you need.
Architecture also changes the game: DeepSeek's MLA compresses the KV cache ~28× versus standard attention, and some hyper-optimized MoE only request ~1.2 GB of cache for 64K of context. The choice of model matters as much as its size.
Practical totals in Q4_K_M (8K context)
Weights + KV cache + overhead, for realistic use:
- 7–8B: ~6–7 GB
- 13–14B: ~10–12 GB
- 32B: ~21–23 GB
- 70B: ~43–46 GB
Pocket formula: VRAM ≈ (param_count × 0.5) + 2 GB. Example: Llama 3.1 8B in Q4_K_M = (8 × 0.5) + 2 = 6 GB.
How much VRAM for which model?
- 8 GB: a 7B Q4_K_M fits comfortably up to ~8K context, and can stretch to 32K by keeping overhead tight.
- 16–24 GB: 22–35B models (Gemma 3 27B, Qwen3 32B, GPT-OSS 20B, DeepSeek R1 32B) in Q4_K_M.
- 48 GB and up: required for 70B+ (Llama 3.3 70B, Qwen2.5 72B). Two RTX 3090s (2×24 GB) are enough, and NVLink is not required.
Four traps to avoid
- MoE models still need all the VRAM. Widespread misconception: no, all parameters must be in memory, not just the active experts. Mixtral 8x7B holds 46.7B parameters even though ~13B are active per token. The active count makes the model faster, not smaller.
- File size is not the truth. The runtime footprint exceeds the
.ggufsize: a 15.6 GB file can request over 20 GB of real VRAM at 64K context. - Offloading kills performance. Offloading a 70B's excess layers to RAM drops speed from 25+ tok/s to 3–5 tok/s. Better a smaller model that fits entirely.
- Plan for headroom. For long-context or high-concurrency production, add 30–50% VRAM beyond the base size to absorb KV cache growth.
In short
To know whether a model fits: take 0.56 GB × billions of parameters (weights in Q4_K_M), add the KV cache (which explodes with context) and ~1 GB of overhead. If the total exceeds your VRAM, you have three levers: drop the quantization, reduce the context, or quantize the KV cache — in that order of preference. And never forget: a model that doesn't fit entirely in VRAM doesn't "run slower," it crawls.
Comments