In just a few months, the language model landscape has been turned upside down. DeepSeek, a Chinese lab still largely unknown in early 2025, released models that rival GPT-4 and Claude across many benchmarks while being fully open source. This democratization is a game-changer for system administrators and developers who want to control their own AI infrastructure without depending on proprietary APIs.
This article takes stock of the open source LLM ecosystem in 2026, compares the main options available, and walks you step by step through hosting your own model on your server.
DeepSeek: the earthquake from China
DeepSeek made a major impact with two flagship models that reshuffled the deck.
DeepSeek V3: the formidable generalist
DeepSeek V3 is a massive Mixture-of-Experts (MoE) model with 671 billion parameters, of which only 37 billion are activated for each token. This architecture delivers performance close to GPT-4 while drastically reducing inference costs.
The key characteristics of V3:
- 671B total parameters, 37B activated per token (MoE architecture)
- 128K-token context window
- Performance comparable to GPT-4o on most benchmarks
- Estimated training cost of just 5.5 million dollars (versus hundreds of millions for equivalent Western models)
- Permissive open source license allowing commercial use
DeepSeek R1: reasoning within everyone's reach
DeepSeek R1 is DeepSeek's reasoning model, comparable to OpenAI's o1 family. Its distinctive feature is that it "thinks out loud," breaking down complex problems into intermediate steps before producing a final answer.
R1 excels in particular at mathematics, programming and logical reasoning. And unlike o1, it is fully open source, which means you can host and adapt it without restriction.
The rise of open source Chinese LLMs
DeepSeek is not an isolated case. The Chinese open source LLM ecosystem has grown considerably stronger, challenging American and European dominance.
Qwen (Alibaba) offers models ranging from 0.5B to 72B parameters, with variants specialized for code (Qwen Coder) and vision (Qwen VL). Yi (01.AI), founded by Kai-Fu Lee, provides high-performing models under the Apache 2.0 license. These labs benefit from access to enormous volumes of training data in Chinese and English, as well as massive investment.
This competition is good for the entire community: it pushes performance higher while making models more accessible. For a European system administrator, it means a wider choice and higher-quality models to host locally.
Open source LLM comparison in 2026
Here is an overview of the main models available for self-hosting, along with their strengths and preferred use cases.
DeepSeek R1 and V3
- Strengths: unbeatable performance-to-cost ratio, excellent at reasoning (R1), efficient MoE architecture (V3)
- Weaknesses: large size for V3 (requires a multi-GPU cluster for the full model), newer community
- Ideal for: complex reasoning, code analysis, tasks that require step-by-step thinking
Meta Llama 3.x
- Strengths: mature ecosystem, large community, excellent documentation, models available from 1B to 405B
- Weaknesses: more restrictive license than the Chinese competitors (Community License with a 700M monthly-users threshold)
- Ideal for: general-purpose use, chatbots, content generation, rapid prototyping
Mistral and Mixtral
- Strengths: high-quality French models, excellent size-to-performance ratio, MoE architecture with Mixtral, strong command of French
- Weaknesses: narrower lineup than Meta or DeepSeek
- Ideal for: French-language applications, tasks requiring a good level of French, deployments on limited hardware
Qwen 2.5
- Strengths: very broad range of sizes (0.5B to 72B), specialized variants (code, vision, audio), Apache 2.0 license
- Weaknesses: performance slightly below the best at comparable sizes
- Ideal for: multimodal applications, code assistants, embedded deployments (small models)
Why host your own LLM?
Using a proprietary API like those from OpenAI or Anthropic is the simplest solution, but self-hosting offers decisive advantages in certain contexts.
Data sovereignty and confidentiality
When you send a request to a cloud API, your data travels through the provider's servers. For sensitive data (medical, legal, industrial), this can raise GDPR compliance or professional-secrecy concerns. With a self-hosted LLM, your data never leaves your own infrastructure.
Control and customization
A self-hosted model can be fine-tuned on your own data, adapted to your domain vocabulary, and configured precisely to your needs. You control the inference parameters, context management and the model's behavior down to the smallest detail.
Availability and latency
No dependence on a third-party service's availability. No rate limiting. No network latency to a distant datacenter. Your model responds in milliseconds from your local network.
Hardware requirements: GPU, VRAM and quantization
Hardware choice is the most defining decision for self-hosting. VRAM (the GPU's video memory) is the main limiting factor.
Rule of thumb for VRAM
In FP16 (half precision), a model requires roughly 2 GB of VRAM per billion parameters. A 7B-parameter model therefore needs about 14 GB of VRAM, and a 70B model about 140 GB.
Quantization: fitting an elephant into a Smart car
Quantization reduces the precision of the model's weights to shrink its memory footprint. Three main formats coexist in 2026:
- GGUF (llama.cpp): the most versatile format, supporting hybrid CPU+GPU execution. Quantizations from Q2_K (2 bits) to Q8_0 (8 bits). Ideal for setups with little VRAM.
- GPTQ: GPU-only quantization, very fast at inference. Requires a compatible GPU. Available in 4-bit and 8-bit.
- AWQ (Activation-Aware Weight Quantization): more recent, offering a better quality/compression trade-off than GPTQ. Preferred for models deployed in production.
Recommended hardware configurations
Here are typical configurations depending on the size of model you're targeting:
- 7-8B models (Q4): 6 GB VRAM minimum. An NVIDIA RTX 3060 12 GB or RTX 4060 Ti 16 GB is enough. Cost: roughly 300 to 400 euros used.
- 13-14B models (Q4): 10 GB VRAM minimum. RTX 3090 24 GB or RTX 4070 Ti Super recommended. Cost: 500 to 900 euros.
- 32-34B models (Q4): 20 GB VRAM minimum. An RTX 3090 24 GB is possible but tight. Ideally an RTX 4090 24 GB or two GPUs in parallel.
- 70B models (Q4): 40 GB VRAM minimum. Requires multi-GPU or professional GPUs (A6000 48 GB, A100 80 GB). Cost: 2000 euros and up.
Deployment tools: Ollama pour l'inférence locale, vLLM and llama.cpp
Ollama: simplicity first
Ollama is the most approachable tool to get started. It automatically handles model downloads, quantization and exposing an OpenAI-compatible API.
# Installing Ollama on Linux
curl -fsSL https://ollama.com/install.sh | sh
# Download and run distilled DeepSeek R1 (7B)
ollama run deepseek-r1:7b
# Download Mistral 7B
ollama run mistral
# List installed models
ollama list
# Expose the OpenAI-compatible API
# Ollama listens by default on http://localhost:11434
curl http://localhost:11434/api/generate -d '{
"model": "deepseek-r1:7b",
"prompt": "Explain the principle of least privilege in IT security."
}'
vLLM: performance in production
vLLM is an inference engine optimized for throughput. Its PagedAttention system manages GPU memory far more efficiently than naive solutions, allowing it to serve more concurrent requests.
# Installing vLLM
pip install vllm
# Launch an OpenAI-compatible inference server
python -m vllm.entrypoints.openai.api_server \
--model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
--dtype auto \
--max-model-len 8192 \
--gpu-memory-utilization 0.9 \
--port 8000
# Test with curl
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",
"messages": [{"role": "user", "content": "Hello, who are you?"}],
"temperature": 0.7
}'
llama.cpp: maximum flexibility
llama.cpp is the founding project of LLM inference on consumer hardware. Written in C/C++, it offers unmatched flexibility and supports hybrid CPU+GPU execution, which makes it possible to run models larger than VRAM alone would allow.
# Clone and build llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)
# Launch a server with a GGUF model
./build/bin/llama-server \
-m ./models/deepseek-r1-7b-q4_k_m.gguf \
--host 0.0.0.0 \
--port 8080 \
-ngl 35 \
-c 4096
# -ngl 35: number of layers offloaded to the GPU
# -c 4096: context size
Docker deployment for LLM inference
For a clean, reproducible deployment, Docker is the recommended solution. Check out our Docker tutorial if you're new to containerization.
# docker-compose.yml for Ollama with a web interface
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
volumes:
- ollama_data:/root/.ollama
ports:
- "11434:11434"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
restart: unless-stopped
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
volumes:
- openwebui_data:/app/backend/data
ports:
- "3000:8080"
environment:
- OLLAMA_BASE_URL=http://ollama:11434
depends_on:
- ollama
restart: unless-stopped
volumes:
ollama_data:
openwebui_data:
# Launch the stack
docker compose up -d
# Download a model inside the Ollama container
docker exec ollama ollama pull deepseek-r1:7b
docker exec ollama ollama pull mistral
# Check that everything works
curl http://localhost:11434/api/tags
nvidia-container-toolkit. Without it, models will run on CPU only, which will be extremely slow. See our Docker tutorial for the GPU configuration.
Cost analysis: cloud API vs self-hosting
The choice between a cloud API and self-hosting depends on your usage volume.
Cloud API (indicative prices, early 2026)
- GPT-4o: roughly 2.50 dollars per million input tokens, 10 dollars output
- Claude Sonnet: roughly 3 dollars per million input tokens, 15 dollars output
- DeepSeek API: roughly 0.14 dollar per million input tokens, 0.28 dollar output (the cheapest on the market)
Self-hosting (amortized cost)
- RTX 4090 (24 GB): roughly 1800 euros. Lets you run 7-14B models with fast inference.
- Power consumption: roughly 350W under load, i.e. about 50 euros per month under continuous operation (at French electricity rates).
- Amortized over 3 years: roughly 50 euros per month for the GPU + 50 euros of electricity = 100 euros per month all in.
The break-even point is generally around 50 to 100 million tokens per month. Below that, a cloud API (especially the DeepSeek API) is more economical. Above it, self-hosting becomes clearly advantageous.
Self-hosting security: points of vigilance
Hosting an LLM on your server brings new security responsibilities.
Never expose the API directly on the Internet
# BAD PRACTICE: Ollama exposed on all interfaces
# ollama serve --host 0.0.0.0
# GOOD PRACTICE: listen on localhost only
# and use a reverse proxy with authentication
# Nginx configuration as a reverse proxy
# /etc/nginx/sites-available/ollama
# server {
# listen 443 ssl;
# server_name llm.mydomain.com;
#
# auth_basic "LLM Access";
# auth_basic_user_file /etc/nginx/.htpasswd;
#
# location / {
# proxy_pass http://127.0.0.1:11434;
# proxy_set_header Host $host;
# }
# }
# Create the password file
sudo apt install apache2-utils
sudo htpasswd -c /etc/nginx/.htpasswd admin
Firewall and network isolation
Apply the same principles as for any sensitive service. Our UFW guide covers the configuration in detail. Here is a summary for an LLM server:
# Block everything by default
sudo ufw default deny incoming
sudo ufw default deny outgoing
# SSH only
sudo ufw allow in 22/tcp
# HTTPS for the reverse proxy
sudo ufw allow in 443/tcp
# Allow updates and model downloads
sudo ufw allow out 443/tcp
sudo ufw allow out 53/udp
sudo ufw enable
Resource monitoring
# Monitor GPU usage in real time
watch -n 1 nvidia-smi
# Alert script if VRAM exceeds 90%
#!/bin/bash
VRAM_USAGE=$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits)
VRAM_TOTAL=$(nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounits)
PERCENT=$((VRAM_USAGE * 100 / VRAM_TOTAL))
if [ "$PERCENT" -gt 90 ]; then
echo "ALERT: VRAM at ${PERCENT}%" | mail -s "GPU Alert" [email protected]
fi
Conclusion
The arrival of DeepSeek and the rise of open source LLMs have profoundly changed the self-hosting equation. It is now realistic to host a professional-grade model on a server costing under 2000 euros, with performance sufficient for most enterprise use cases.
For system administrators familiar with Docker and Linux administration, deploying a local LLM is no longer an insurmountable challenge. Ollama has made the process as simple as a docker compose up.
The questions to ask yourself before getting started: how many tokens do you process per month? Is your data so sensitive that it cannot travel through a cloud API? Do you have the GPU hardware you need, or the budget to acquire it? If you answer "a lot," "yes" and "yes," self-hosting is for you.
Start small with a 7B model on Ollama, validate your use case, then scale up gradually. And don't forget to secure your deployment: an LLM exposed on the Internet without authentication is an invitation to abuse.
Comments