Local fine-tuning fuels fantasies: training « your own » model, injecting your document base, and getting a homemade assistant that knows everything by heart. The 2026 reality is more nuanced, and far more accessible on the hardware side than people think. With QLoRA and Unsloth, fine-tuning a 7B model fits on an RTX 3060 12 GB. But before launching any training run, you need to understand what fine-tuning can do — and, above all, what it cannot.
This article gives the real numbers: VRAM per model size, dataset size, frameworks, and the honest verdict on the question that always comes up: fine-tuning or RAG?
LoRA and QLoRA: why it fits on a gaming card
A « full » fine-tune (all parameters) of a 7B model requires holding the weights, the gradients AND the optimizer states in VRAM simultaneously: 60 to 70 GB minimum. Out of reach for a consumer GPU.
LoRA (Low-Rank Adaptation) changes everything: you freeze the base model and only train small low-rank « adapter » matrices grafted onto the attention layers. You go from billions of trainable parameters to a few million.
QLoRA pushes further: the base model is loaded in 4-bit NormalFloat (NF4), cutting weight memory by roughly 75 %, while the LoRA adapters train in BF16. Only the adapter gradients and optimizer states are stored. The quality cost? In most benchmarks, only a 1 to 2 % gap versus full fine-tuning — negligible for a 10x reduction in hardware needs.
How much VRAM, concretely?
Here is the real memory breakdown of a QLoRA run on a 7B model:
- Model weights (NF4): ~3.5 GB
- LoRA adapters (FP16): ~0.1 GB
- Optimizer states: ~0.4 GB
- Gradients + activations: ~4.0 GB
- Total: ~8 GB
By model size, the 2026 recommendations are as follows:
- 7B with QLoRA: 8 to 12 GB of VRAM. An RTX 3060 12 GB is enough. An RTX 4080 16 GB is comfortable.
- 13B with QLoRA: trains without drama on a 24 GB GPU (RTX 3090 or 4090).
- Full fine-tuning 7B: 60-70 GB — forget it on consumer hardware.
Caveat: these numbers assume the GPU does only this. If you run inference in parallel, add the corresponding budget. Also plan for 32 GB of system RAM for a 7B, 64 GB for 13B+, and an NVMe SSD for data loading. To size your machine properly, our VRAM/RAM calculation guide for local LLMs details the method.
Unsloth vs Axolotl: which one to pick?
Two frameworks dominate local fine-tuning. The choice is not a matter of taste.
Unsloth — king of single-GPU
Unsloth rewrites PyTorch backpropagation steps into hand-optimized Triton kernels. Result on an identical 8B benchmark: 3.2 hours versus 5.8 hours for Axolotl, i.e. ~2x faster, with up to 70-80 % less VRAM compared to FlashAttention 2 — with no measurable accuracy degradation. Broad compatibility, from GTX 1070 to H100. Support for Llama 4, Qwen 3, DeepSeek-R1, Phi-4.
Decisive limit: Unsloth does not support multi-GPU.
Axolotl — production and multi-GPU
Axolotl relies on a version-controllable YAML config, best-in-class data preprocessing and multi-GPU support. It is slower (abstraction layers on top of Hugging Face Transformers) but reproducible and reliable in a pipeline. Subtlety: Axolotl can use Unsloth as a backend to recover the speed.
The verdict:
- Solo, single-GPU, limited VRAM → Unsloth, no hesitation.
- Cluster, production pipeline, DPO/RLHF alignment → Axolotl.
- Hybrid workflow: Unsloth for SFT (compute-bound), Axolotl for DPO (which doubles VRAM with a reference model).
The dataset: where 90 % of fine-tunes fail
Most failures are not hardware problems but data problems. LoRA, having few parameters, is extremely sensitive to noise and inconsistencies. The golden rule: 500 clean examples beat 5,000 noisy ones.
How many examples by task?
- Classification or simple formatting: 500 to 1,000 examples are enough to see a measurable improvement.
- Complex instruction-following / new domain: 3,000 to 10,000 high-quality examples.
- Beyond 10,000: diminishing returns, unless the domain is very broad.
On format, two standards: Alpaca (fields instruction / input / output, ideal for single-turn) and ShareGPT/ChatML (messages role / content, for multi-turn conversational assistants).
Two well-documented traps: generating your data with another LLM often amounts to teaching your model to imitate that LLM rather than a real skill; and piling on epochs hurts — on Alpaca, going from 1 to 2 epochs degraded performance.
Fine-tuning or RAG? The honest verdict
This is THE question, and the 2025-2026 research is clear: RAG has become the standard for injecting knowledge. Several studies (notably « Fine-Tuning or Retrieval? ») show a significant advantage for RAG over fine-tuning on factual and current-events tasks.
Why? LoRA's low-rank constraint is built for form, not facts. A detailed analysis of instruction tuning shows that LoRA mainly learns « response initiation » and style tokens, extracting most of the content from knowledge already present in the base model. Injecting medical or statistical facts spreads across too many dimensions to fit into low-rank updates — hence unstable training and incomplete outputs. Add catastrophic forgetting and the reversal curse, which RAG (in-context learning) avoids.
When fine-tuning wins:
- Style, tone, persona: making the model speak a certain way, adopting a brand voice.
- Strict output format: clean JSON, enforced structure, a homemade API format.
- Domain terminology and behavioral triggers.
- Shortening prompts: the behavior lives in the weights, no need to re-explain everything on every request.
When RAG wins:
- Factual knowledge, proprietary documents, data that changes.
- Need for traceability and citable sources.
- No retraining on every corpus update.
The pragmatic decision hierarchy: 1) prompt engineering first; 2) RAG for knowledge; 3) LoRA/QLoRA only when you need to change the model's behavior. And nothing forbids the hybrid: a model fine-tuned for style, plugged into a RAG pipeline for the facts.
Choosing your base model
To fine-tune locally, start from a permissively licensed 7B-8B model: Llama 3.1 8B, Mistral 7B or Qwen 2.5 7B. The Mistral 3 family under Apache 2.0 is an excellent starting point on the European and licensing front. To see what is available today, our roundup of new local LLM models in 2026 takes stock. Once the adapter is trained, you can merge it and serve it through Ollama like any local model.
Realistic expectations
Local fine-tuning is neither magic nor out of reach. It is hardware-accessible (a 7B on a sub-400 € GPU) but data-demanding. It excels at shaping behavior, style and format. It fails at memorizing facts reliably — for that, RAG remains king. The best reflex in 2026: try the prompt, then RAG, and only reach for LoRA when you genuinely want to change the model's personality, not its encyclopedia.
Comments