Intelligence Artificielle 18/06/2026 5 min read

Local RAG with Ollama: an assistant that reads YOUR documents, 100% offline

Build a privacy-first RAG assistant on your own documents: embeddings, vector DB, chunking and a local LLM. With qwen3-embedding hitting 70.58 on multilingual MTEB, local finally rivals commercial APIs.

Even an excellent LLM knows nothing about your contracts, your internal procedures or your meeting notes. RAG (Retrieval-Augmented Generation) closes that gap: you index your documents, retrieve the relevant passages for each question, and inject them into the model's context. Done right, you get an assistant that cites your real sources — and with Ollama, everything stays on your machine. Zero leaks, zero subscription, zero Internet.

This is an engineer's guide, not a marketing demo. We'll lay out the real 2026 numbers, the pitfalls that break a RAG pipeline, and an architecture that actually holds up offline.

Why local RAG got serious in 2026

Until recently, the "local is worse" argument rested mostly on embeddings — the vectors that turn text into semantic coordinates. That's no longer true. The qwen3-embedding family is the first to compete head-on with commercial APIs: the 8B model scores 70.58 on the multilingual MTEB leaderboard. For non-English content, that's decisive.

The classics remain solid: nomic-embed-text (73.8M pulls on Ollama, MTEB English score ~62.4, native 8,192-token context) and mxbai-embed-large (score ~64.7, but context capped at 512 tokens). In real RAG testing the gap is small: nomic wins on short, direct questions (63.75% retrieval accuracy vs 57.5%), while mxbai takes the lead on long, context-heavy ones.

The architecture in 5 building blocks

A complete offline RAG pipeline fits into five stages:

Ingestion: extract text from your PDFs, Markdown, DOCX, etc.
Chunking: split into digestible pieces for search.
Embeddings: vectorize each chunk via an Ollama model.
Vector DB: store and index the vectors (Chroma or Qdrant).
Retrieval + generation: fetch the top-k chunks and pass them to a local LLM.

Setup is trivial: ollama pull nomic-embed-text for embeddings, ollama pull qwen3:8b (or llama3.1:8b) for generation, and a local vector DB. Everything goes through the http://localhost:11434 API.

Chunking: where 80% of RAGs fail

Splitting is the most underestimated parameter. Too large, and chunks drown the signal; too small, and they lose context. The 2026 consensus is clear:

Size: 256 to 512 tokens covers most cases. Drop to 128-256 for precise factual questions, raise to 512-1024 for analytical or summary tasks. Watch the "context cliff" identified around 2,500 tokens, where quality drops.
Overlap: 10 to 20% of chunk size (~50-100 tokens for 512). NVIDIA measured 15% as optimal on FinanceBench. But a January 2026 study showed that on some corpora overlap adds nothing and just bloats the index: test on YOUR data.
Strategy: recursive splitting (paragraphs → sentences) is the sane default. Chroma's research gives 85-90% recall at 400 tokens recursively, versus 91-92% semantically — but semantic forces you to embed every sentence, a far higher compute cost.

Critical trap: nomic-embed-text is natively 8,192 tokens but the Ollama card lists 2,048. If your chunks exceed 2,048 tokens without setting num_ctx, they get silently truncated. You think you're indexing a document; you're only indexing a third of it.

Chroma or Qdrant? The real trade-off

Both run locally, but they don't play in the same league:

Chroma: the simplest. Runs in memory or in a tiny Docker container, with the most mature LangChain/LlamaIndex integrations. Perfect for prototyping. Real ceiling: past a few hundred thousand vectors, performance gets inconsistent.
Qdrant: written in Rust, low memory footprint, predictable p99 latency, excellent at metadata filtering (filter by source, date, author without blowing up latency). Handles millions of vectors on a small VPS at $30-50/month.

Verdict: Chroma to get started and validate your pipeline in an afternoon. Qdrant as soon as you target production, serious filtering, or a corpus over ~100,000 chunks. The community rule "Chroma for the prototype, Qdrant for production" shows up too often to ignore. My advice: if you know it will grow, start straight on Qdrant to avoid a forced migration.

The generation LLM: size it by VRAM

Retrieval brings the chunks, the LLM writes the answer. Pick it by your hardware — and remember the retrieved chunks consume the context window (the KV cache grows with it):

6-8 GB VRAM: Llama 3.1 8B or Qwen3 8B at Q4_K_M, 40+ tokens/s. The sweet spot for most users.
10-12 GB: Gemma 3 12B or Qwen3 14B, comfortable extended context.
24 GB: Qwen3 32B or Gemma 3 27B (fits on an RTX 4090 with tight context).
48 GB+: Llama 3.3 70B, the RAG reference (128K context, minimal hallucination on retrieved content, 95% of the 405B's quality at 17% of the parameters).

Two free optimizations: enable Flash Attention (less VRAM, more speed, zero quality loss) and, if needed, KV cache quantization at Q8_0 (halves cache memory). Avoid Q4_0 on the KV cache: the quality loss becomes visible. To properly budget weights + cache + overhead, see our guide on VRAM and RAM: calculating what runs an LLM locally.

Putting it all together, in practice

The request flow: the question is embedded by the same model as the documents (never a different one, or you get incompatible vector spaces), Qdrant/Chroma returns the top-k chunks (k=4 to 6 is a good start — avoid k=2, too restrictive), then you build a prompt "here are the excerpts, answer only from them and cite your sources." The local LLM generates.

For tooling, LangChain and LlamaIndex are the glue, but for a hand-coded assistant you only need three HTTP calls (embed, search, generate). And if you want to wire this RAG into your editor or an agent, look at our Ollama integrations with Codex, Claude and OpenClaw locally. For picking the generation model, our roundup of the new local LLM models of 2026 on Ollama stays current.

The honest verdict

Local RAG in 2026 is no longer a second-best compromise. With qwen3-embedding on par with commercial APIs, Qdrant absorbing millions of vectors for the price of a VPS, and 8-14B LLMs answering at 40+ tokens/s, you get a private assistant that's free to run and fully offline. The only real work isn't the infra — it's the chunking and evaluation on your data. Nobody can do that for you, and that's exactly where quality is won.

Going further: size your machine with our comparison of AMD Strix Halo Ryzen AI Max+ 395 for local AI, the NVIDIA DGX Spark on the desktop verdict, and on the European open-source side, Mistral 3 under Apache 2.

Did you enjoy this article?

Comments

Morgann Riu

Cybersecurity and Linux administration expert. I help companies secure and optimize their critical infrastructures.

Contact me

RAG Ollama local AI embeddings Qdrant Chroma privacy LLM

Back to the blog

Why local RAG got serious in 2026

The architecture in 5 building blocks

Chunking: where 80% of RAGs fail

Chroma or Qdrant? The real trade-off

The generation LLM: size it by VRAM

Putting it all together, in practice

The honest verdict

Comments

Recommended for you

Quantification GGUF : Q4_K_M, Q5_K_M, Q6_K ou Q8_0 — comment choisir sans casser la qualité

RAG local avec Ollama : un assistant qui lit VOS documents, 100% hors-ligne

Runtimes LLM local en 2026 : llama.cpp, Ollama, vLLM, LM Studio, TGI, lequel choisir ?

Fine-tuner un LLM en local avec LoRA et QLoRA : VRAM, datasets et attentes réalistes

Go further

Checklist Sécurité Linux