Local LLM Runtimes in 2026: llama.cpp, Ollama, vLLM, LM Studio, TGI, Which One to Pick?

An honest comparison of local LLM inference engines in 2026: vLLM hits ~793 tok/s under concurrent load versus ~41 for Ollama, yet at a single user the gap drops below 10%. When to use each.

"Which engine should I run my LLMs on locally?" has become the single most common question from teams graduating out of the POC stage. The short answer fits in one sentence: it depends entirely on your load profile. At a single user, the five runtimes we compare here hit the same matmul kernels and land within a few percent of each other. The moment you push concurrency, the gap explodes and only two or three engines hold up. This article cuts through the marketing with 2026 numbers.

The big 2026 shift: TGI is dead

Let's start with the news that simplifies the whole debate. On March 21, 2026, Hugging Face moved Text Generation Inference (TGI) into maintenance mode. The project now explicitly redirects new users to vLLM, SGLang, llama.cpp and MLX. Your existing TGI deployments keep working, but for any new project, TGI is no longer in the recommendation path. We keep it in the comparison because it is still massively deployed, and because it had one strength vLLM still struggles to match (see below), but the verdict is clear: don't start fresh on it.

At a single user, everyone ties (almost)

This is the point marketing benchmarks always forget. At batch size 1, on a short prompt, vLLM and Ollama land within 2-10% of each other for the same model and quantization. On an RTX 4090 with a 24B model, all five runtimes hover around 30 tok/s. That makes sense: at batch 1, everyone ends up on the same GPU kernels.

An independent benchmark on Llama 3.1 8B (RTX 4090, single user) puts Ollama at ~62 tok/s in Q4_K_M, vLLM at 71 tok/s in FP16 and 68 in AWQ. The 13% gap is as much about quantization as architecture. The takeaway: if you serve a handful of sequential requests, the simplest engine wins, because it brings no real throughput gain. That is exactly the case for local dev, prototyping, and single-user internal tools.

Under concurrent load, vLLM crushes everything

The picture flips entirely once you stack simultaneous requests. This is where vLLM's continuous batching and PagedAttention make the difference.

  • 10 concurrent users (Llama 3.1 8B FP16): vLLM aggregates requests and sustains ~485 total tok/s, while Ollama stays at ~148 tok/s (near-sequential FIFO queue). A 3.3x factor from a single architectural choice.
  • 50 users: vLLM holds ~920 tok/s, Ollama plateaus at ~155 tok/s.
  • The most-cited Red Hat benchmark: vLLM at 793 tok/s versus 41 tok/s for Ollama on the same hardware, a roughly 19x gap.

On the peer-reviewed side (arXiv, 2511.17593), on LLaMA-2-7B at 100 concurrent requests, vLLM reaches 15,243 tok/s versus 4,156 for TGI (3.67x), and the gap widens to 24x under extreme load (200 requests). On a 70B with 4-GPU tensor parallelism, the advantage drops to 2.1x — communication overhead narrows the gap.

Why does Ollama collapse? It implements neither PagedAttention nor continuous batching. Past 5-8 simultaneous requests, P95 latency spikes and the queue grows. Ollama allocates the KV cache statically per request; vLLM splits it into non-contiguous pages allocated on demand, cutting memory waste by 19-27% and keeping the GPU at 85-92% utilization.

vLLM's blind spot: very long prompts

An honest detail too often glossed over: before going into maintenance, TGI v3 processed ~3x more tokens and up to 13x faster than vLLM on very long prompts (>200,000 tokens) thanks to prefix caching. A reply served in 27.5s by vLLM dropped to ~2s under TGI v3. If your workload is conversations with a giant history, check your engine's prefix caching before concluding.

CPU and Apple Silicon: home turf for llama.cpp and LM Studio

On pure CPU, llama.cpp is the fastest: dependency-free C/C++ implementation, GGUF quantization, the broadest hardware support on the market (CUDA, ROCm/HIP, Vulkan, Metal, even Moore Threads via MUSA). It is the engine for machines with no dedicated GPU, for the edge, for hybrid CPU+GPU where part of the model lives in VRAM and the rest in system RAM. For the memory trade-offs, see our VRAM/RAM math for running a local LLM.

On Apple Silicon, Apple's MLX backend is typically 30 to 50% faster than llama.cpp on Metal, because it talks directly to the GPU and Neural Engine. LM Studio made it its flagship argument: a unified MLX engine, a Hugging Face model browser that shows estimated VRAM and quant level before download, offline RAG, MCP support, and an OpenAI-compatible API server on localhost:1234. For the record, LM Studio and Ollama both wrap llama.cpp under the hood — the throughput gap between them is usually under 5%. This comparison matters whatever your hardware: Mac Studio M4 Max vs M3 Ultra for unified memory, AMD Strix Halo Ryzen AI Max 395 on the iGPU side, or NVIDIA DGX Spark for the desktop.

Quantization: the right format for the right engine

An often-ignored 2026 rule of thumb: GGUF runs in llama.cpp, not in vLLM. In vLLM, GGUF preserves quality well (6.74 perplexity) but caps at ~93 tok/s. On GPU, the best formats for vLLM are:

  • FP8 (W8A8) on Hopper/Ada (H100, RTX 40xx): -2x memory, up to +1.6x throughput, minimal accuracy loss. The default on modern hardware.
  • AWQ INT4 with Marlin kernels: the best quality/speed trade-off. Marlin delivers up to 10.9x speedup on AWQ and 2.6x on GPTQ. The Marlin-AWQ combo hits 741 tok/s with the best quality (51.8% Pass@1) in tests.
  • GPTQ INT4: slightly below AWQ on quality, but a huge ecosystem of pre-quantized models.

On the llama.cpp/GGUF side, the sweet spot remains Q4_K_M for CPU decode. To fit a ~50B on 16 GB of VRAM, drop to IQ3_S or Q3_K_M with CPU offload; for near-lossless, Q8_0. And remember: bits aren't everything — a well-designed Q4_K_M beats a naive legacy 5-bit format.

OpenAI-compatible API: portability as insurance

Good news: all five expose an OpenAI-compatible endpoint (/v1/chat/completions, /v1/completions). Migrating from one engine to another often touches only a BASE_URL environment variable. The remaining friction is model format (Ollama uses tags like llama3:8b, vLLM uses Hugging Face repo IDs like meta-llama/Llama-3.1-8B-Instruct; you can't point vLLM at a .gguf) and completeness: vLLM's implementation is the most complete (logprobs, structured output via guided decoding). To wire all this into your dev tools, see integrating Ollama with Codex, Claude and OpenClaw locally.

Verdict: the right tool for the right tier

  • Ollama — a model running in under 5 minutes, unbeatable DX, containerizable headless. Perfect for dev/staging and internal tools. Does not scale past ~5-8 users.
  • LM Studio — the GUI on-ramp: a Mac with MLX without touching a terminal, or Windows/AMD iGPU where its Vulkan offload beats Ollama's CUDA-only fast path. Caveat: the GUI app is closed-source (a blocker for sovereign/audited deployments — Jan.ai as an alternative).
  • llama.cpp — CPU, edge, hybrid CPU+GPU, full control of every flag, exotic hardware. The low-level Swiss Army knife.
  • vLLM — the production default. When in doubt, it's the one: the most deployed, the most hardware-compatible, 3-20x Ollama's throughput under concurrency. The only real blind spot = prompts >200k tokens.
  • TGI — avoid for any new project (in maintenance since March 2026).

The winning pattern for mature teams: Ollama/LM Studio locally and in CI, vLLM in production, all behind the same OpenAI protocol. One pitfall to watch: quantization drift. Your dev Ollama is probably running Q4 4-bit, your prod vLLM FP16 or FP8 — behavior differs subtly. Pin the same quant on both sides. And always challenge these numbers: benchmark on your own hardware, your own model, your real concurrency.

Going further: our pick of the new local LLM models for 2026, the Mistral 3, the European Apache 2 open source model breakdown, and our guide to self-hosting DeepSeek as open source to pick the model before the engine.

Did you enjoy this article?

Comments

Morgann Riu

Cybersecurity and Linux administration expert. I help companies secure and optimize their critical infrastructures.

Back to the blog

Checklist Sécurité Linux

30 points essentiels pour sécuriser un serveur Linux. Recevez aussi les nouveaux tutoriels par email.

Pas de spam. Désabonnement en 1 clic.