DeepSeek and Open Source LLMs: Hosting Your Own AI in 2026

DeepSeek, Llama, Mistral: open source LLMs now rival proprietary models. A guide to understanding the ecosystem and hosting your own AI.

In just a few months, the language model landscape has been turned upside down. DeepSeek, a Chinese lab still largely unknown in early 2025, released models that rival GPT-4 and Claude across many benchmarks while being fully open source. This democratization is a game-changer for system administrators and developers who want to control their own AI infrastructure without depending on proprietary APIs.

This article takes stock of the open source LLM ecosystem in 2026, compares the main options available, and walks you step by step through hosting your own model on your server.

DeepSeek: the earthquake from China

DeepSeek made a major impact with two flagship models that reshuffled the deck.

DeepSeek V3: the formidable generalist

DeepSeek V3 is a massive Mixture-of-Experts (MoE) model with 671 billion parameters, of which only 37 billion are activated for each token. This architecture delivers performance close to GPT-4 while drastically reducing inference costs.

The key characteristics of V3:

  • 671B total parameters, 37B activated per token (MoE architecture)
  • 128K-token context window
  • Performance comparable to GPT-4o on most benchmarks
  • Estimated training cost of just 5.5 million dollars (versus hundreds of millions for equivalent Western models)
  • Permissive open source license allowing commercial use

DeepSeek R1: reasoning within everyone's reach

DeepSeek R1 is DeepSeek's reasoning model, comparable to OpenAI's o1 family. Its distinctive feature is that it "thinks out loud," breaking down complex problems into intermediate steps before producing a final answer.

R1 excels in particular at mathematics, programming and logical reasoning. And unlike o1, it is fully open source, which means you can host and adapt it without restriction.

Good to know: DeepSeek has also released distilled versions of R1 (1.5B, 7B, 8B, 14B, 32B, 70B parameters) based on Qwen and Llama architectures. These lighter versions are ideal for self-hosting on modest hardware.

The rise of open source Chinese LLMs

DeepSeek is not an isolated case. The Chinese open source LLM ecosystem has grown considerably stronger, challenging American and European dominance.

Qwen (Alibaba) offers models ranging from 0.5B to 72B parameters, with variants specialized for code (Qwen Coder) and vision (Qwen VL). Yi (01.AI), founded by Kai-Fu Lee, provides high-performing models under the Apache 2.0 license. These labs benefit from access to enormous volumes of training data in Chinese and English, as well as massive investment.

This competition is good for the entire community: it pushes performance higher while making models more accessible. For a European system administrator, it means a wider choice and higher-quality models to host locally.

Open source LLM comparison in 2026

Here is an overview of the main models available for self-hosting, along with their strengths and preferred use cases.

DeepSeek R1 and V3

  • Strengths: unbeatable performance-to-cost ratio, excellent at reasoning (R1), efficient MoE architecture (V3)
  • Weaknesses: large size for V3 (requires a multi-GPU cluster for the full model), newer community
  • Ideal for: complex reasoning, code analysis, tasks that require step-by-step thinking

Meta Llama 3.x

  • Strengths: mature ecosystem, large community, excellent documentation, models available from 1B to 405B
  • Weaknesses: more restrictive license than the Chinese competitors (Community License with a 700M monthly-users threshold)
  • Ideal for: general-purpose use, chatbots, content generation, rapid prototyping

Mistral and Mixtral

  • Strengths: high-quality French models, excellent size-to-performance ratio, MoE architecture with Mixtral, strong command of French
  • Weaknesses: narrower lineup than Meta or DeepSeek
  • Ideal for: French-language applications, tasks requiring a good level of French, deployments on limited hardware

Qwen 2.5

  • Strengths: very broad range of sizes (0.5B to 72B), specialized variants (code, vision, audio), Apache 2.0 license
  • Weaknesses: performance slightly below the best at comparable sizes
  • Ideal for: multimodal applications, code assistants, embedded deployments (small models)

Why host your own LLM?

Using a proprietary API like those from OpenAI or Anthropic is the simplest solution, but self-hosting offers decisive advantages in certain contexts.

Data sovereignty and confidentiality

When you send a request to a cloud API, your data travels through the provider's servers. For sensitive data (medical, legal, industrial), this can raise GDPR compliance or professional-secrecy concerns. With a self-hosted LLM, your data never leaves your own infrastructure.

Control and customization

A self-hosted model can be fine-tuned on your own data, adapted to your domain vocabulary, and configured precisely to your needs. You control the inference parameters, context management and the model's behavior down to the smallest detail.

Availability and latency

No dependence on a third-party service's availability. No rate limiting. No network latency to a distant datacenter. Your model responds in milliseconds from your local network.

Hardware requirements: GPU, VRAM and quantization

Hardware choice is the most defining decision for self-hosting. VRAM (the GPU's video memory) is the main limiting factor.

Rule of thumb for VRAM

In FP16 (half precision), a model requires roughly 2 GB of VRAM per billion parameters. A 7B-parameter model therefore needs about 14 GB of VRAM, and a 70B model about 140 GB.

Quantization: fitting an elephant into a Smart car

Quantization reduces the precision of the model's weights to shrink its memory footprint. Three main formats coexist in 2026:

  • GGUF (llama.cpp): the most versatile format, supporting hybrid CPU+GPU execution. Quantizations from Q2_K (2 bits) to Q8_0 (8 bits). Ideal for setups with little VRAM.
  • GPTQ: GPU-only quantization, very fast at inference. Requires a compatible GPU. Available in 4-bit and 8-bit.
  • AWQ (Activation-Aware Weight Quantization): more recent, offering a better quality/compression trade-off than GPTQ. Preferred for models deployed in production.
Recommendation: for general use, Q4_K_M quantization in GGUF format offers the best trade-off between quality and size. The quality loss is minimal compared to the FP16 model, while cutting the required VRAM by four.

Recommended hardware configurations

Here are typical configurations depending on the size of model you're targeting:

  • 7-8B models (Q4): 6 GB VRAM minimum. An NVIDIA RTX 3060 12 GB or RTX 4060 Ti 16 GB is enough. Cost: roughly 300 to 400 euros used.
  • 13-14B models (Q4): 10 GB VRAM minimum. RTX 3090 24 GB or RTX 4070 Ti Super recommended. Cost: 500 to 900 euros.
  • 32-34B models (Q4): 20 GB VRAM minimum. An RTX 3090 24 GB is possible but tight. Ideally an RTX 4090 24 GB or two GPUs in parallel.
  • 70B models (Q4): 40 GB VRAM minimum. Requires multi-GPU or professional GPUs (A6000 48 GB, A100 80 GB). Cost: 2000 euros and up.

Deployment tools: Ollama pour l'inférence locale, vLLM and llama.cpp

Ollama: simplicity first

Ollama is the most approachable tool to get started. It automatically handles model downloads, quantization and exposing an OpenAI-compatible API.

# Installing Ollama on Linux
curl -fsSL https://ollama.com/install.sh | sh

# Download and run distilled DeepSeek R1 (7B)
ollama run deepseek-r1:7b

# Download Mistral 7B
ollama run mistral

# List installed models
ollama list

# Expose the OpenAI-compatible API
# Ollama listens by default on http://localhost:11434
curl http://localhost:11434/api/generate -d '{
  "model": "deepseek-r1:7b",
  "prompt": "Explain the principle of least privilege in IT security."
}'

vLLM: performance in production

vLLM is an inference engine optimized for throughput. Its PagedAttention system manages GPU memory far more efficiently than naive solutions, allowing it to serve more concurrent requests.

# Installing vLLM
pip install vllm

# Launch an OpenAI-compatible inference server
python -m vllm.entrypoints.openai.api_server \
  --model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
  --dtype auto \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.9 \
  --port 8000

# Test with curl
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",
    "messages": [{"role": "user", "content": "Hello, who are you?"}],
    "temperature": 0.7
  }'

llama.cpp: maximum flexibility

llama.cpp is the founding project of LLM inference on consumer hardware. Written in C/C++, it offers unmatched flexibility and supports hybrid CPU+GPU execution, which makes it possible to run models larger than VRAM alone would allow.

# Clone and build llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)

# Launch a server with a GGUF model
./build/bin/llama-server \
  -m ./models/deepseek-r1-7b-q4_k_m.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  -ngl 35 \
  -c 4096

# -ngl 35: number of layers offloaded to the GPU
# -c 4096: context size

Docker deployment for LLM inference

For a clean, reproducible deployment, Docker is the recommended solution. Check out our Docker tutorial if you're new to containerization.

# docker-compose.yml for Ollama with a web interface
services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    volumes:
      - ollama_data:/root/.ollama
    ports:
      - "11434:11434"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    restart: unless-stopped

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    volumes:
      - openwebui_data:/app/backend/data
    ports:
      - "3000:8080"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    depends_on:
      - ollama
    restart: unless-stopped

volumes:
  ollama_data:
  openwebui_data:
# Launch the stack
docker compose up -d

# Download a model inside the Ollama container
docker exec ollama ollama pull deepseek-r1:7b
docker exec ollama ollama pull mistral

# Check that everything works
curl http://localhost:11434/api/tags
Docker GPU prerequisite: to use your NVIDIA GPU inside Docker, you must install the nvidia-container-toolkit. Without it, models will run on CPU only, which will be extremely slow. See our Docker tutorial for the GPU configuration.

Cost analysis: cloud API vs self-hosting

The choice between a cloud API and self-hosting depends on your usage volume.

Cloud API (indicative prices, early 2026)

  • GPT-4o: roughly 2.50 dollars per million input tokens, 10 dollars output
  • Claude Sonnet: roughly 3 dollars per million input tokens, 15 dollars output
  • DeepSeek API: roughly 0.14 dollar per million input tokens, 0.28 dollar output (the cheapest on the market)

Self-hosting (amortized cost)

  • RTX 4090 (24 GB): roughly 1800 euros. Lets you run 7-14B models with fast inference.
  • Power consumption: roughly 350W under load, i.e. about 50 euros per month under continuous operation (at French electricity rates).
  • Amortized over 3 years: roughly 50 euros per month for the GPU + 50 euros of electricity = 100 euros per month all in.

The break-even point is generally around 50 to 100 million tokens per month. Below that, a cloud API (especially the DeepSeek API) is more economical. Above it, self-hosting becomes clearly advantageous.

Self-hosting security: points of vigilance

Hosting an LLM on your server brings new security responsibilities.

Never expose the API directly on the Internet

# BAD PRACTICE: Ollama exposed on all interfaces
# ollama serve --host 0.0.0.0

# GOOD PRACTICE: listen on localhost only
# and use a reverse proxy with authentication

# Nginx configuration as a reverse proxy
# /etc/nginx/sites-available/ollama
# server {
#     listen 443 ssl;
#     server_name llm.mydomain.com;
#
#     auth_basic "LLM Access";
#     auth_basic_user_file /etc/nginx/.htpasswd;
#
#     location / {
#         proxy_pass http://127.0.0.1:11434;
#         proxy_set_header Host $host;
#     }
# }

# Create the password file
sudo apt install apache2-utils
sudo htpasswd -c /etc/nginx/.htpasswd admin

Firewall and network isolation

Apply the same principles as for any sensitive service. Our UFW guide covers the configuration in detail. Here is a summary for an LLM server:

# Block everything by default
sudo ufw default deny incoming
sudo ufw default deny outgoing

# SSH only
sudo ufw allow in 22/tcp

# HTTPS for the reverse proxy
sudo ufw allow in 443/tcp

# Allow updates and model downloads
sudo ufw allow out 443/tcp
sudo ufw allow out 53/udp

sudo ufw enable

Resource monitoring

# Monitor GPU usage in real time
watch -n 1 nvidia-smi

# Alert script if VRAM exceeds 90%
#!/bin/bash
VRAM_USAGE=$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits)
VRAM_TOTAL=$(nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounits)
PERCENT=$((VRAM_USAGE * 100 / VRAM_TOTAL))
if [ "$PERCENT" -gt 90 ]; then
    echo "ALERT: VRAM at ${PERCENT}%" | mail -s "GPU Alert" [email protected]
fi

Conclusion

The arrival of DeepSeek and the rise of open source LLMs have profoundly changed the self-hosting equation. It is now realistic to host a professional-grade model on a server costing under 2000 euros, with performance sufficient for most enterprise use cases.

For system administrators familiar with Docker and Linux administration, deploying a local LLM is no longer an insurmountable challenge. Ollama has made the process as simple as a docker compose up.

The questions to ask yourself before getting started: how many tokens do you process per month? Is your data so sensitive that it cannot travel through a cloud API? Do you have the GPU hardware you need, or the budget to acquire it? If you answer "a lot," "yes" and "yes," self-hosting is for you.

Start small with a 7B model on Ollama, validate your use case, then scale up gradually. And don't forget to secure your deployment: an LLM exposed on the Internet without authentication is an invitation to abuse.

Did you enjoy this article?

Comments

Morgann Riu

Cybersecurity and Linux administration expert. I help companies secure and optimize their critical infrastructures.

Back to the blog

Checklist Sécurité Linux

30 points essentiels pour sécuriser un serveur Linux. Recevez aussi les nouveaux tutoriels par email.

Pas de spam. Désabonnement en 1 clic.