Docker and Docker Compose installed on your server. Basic Linux administration knowledge (systemd, YAML configuration files). For installing Docker, see the Docker installation guide.
Why Prometheus? Pull vs Push, the real architecture
Monitoring a Linux infrastructure long boiled down to Nagios and its plugins, or Zabbix and its agent. These tools work on a push model: each agent sends its data to a central collector. Prometheus reverses this paradigm with the pull model: it is the Prometheus server that periodically queries each target on its /metrics endpoint.
This reversal is not cosmetic. It fundamentally changes how monitoring is operated:
- Detecting dead instances: if Prometheus cannot scrape a target, the
upmetric drops to 0. With push, a silent instance is undetectable. - Centralized configuration: all targets are defined in Prometheus, not scattered across each agent. A single YAML file to see the entire infrastructure.
- Simple debugging: the
/metricsendpoint is plain text readable withcurl, with no intermediate agent to debug. - Standardized ecosystem: the Prometheus exposition format has become an OpenMetrics standard adopted by hundreds of applications (Nginx, PostgreSQL, Redis, Kubernetes, etc.).
The four components of the stack
┌─────────────────┐ scrape /metrics ┌──────────────────┐ alerts ┌─────────────────┐
│ Node Exporter │ ◄─────────────────── │ Prometheus │ ──────────► │ AlertManager │
│ (port 9100) │ │ (port 9090) │ │ (port 9093) │
│ Nginx Exporter │ ◄─────────────────── │ local TSDB │ │ Slack / Email │
│ (port 9113) │ │ PromQL │ │ PagerDuty │
│ Blackbox Exp. │ ◄─────────────────── │ Alerting rules │ └─────────────────┘
│ (port 9115) │ └──────────────────┘
└─────────────────┘ │
│ datasource
▼
┌──────────────────┐
│ Grafana │
│ (port 3000) │
│ Dashboards │
│ Alerting │
└──────────────────┘
- Prometheus: collects (scrapes) metrics, stores them in a local time-series database (TSDB), and evaluates alerting rules every 15 seconds.
- Exporters: translate system or application metrics into the Prometheus format. Node Exporter for Linux, Nginx Exporter for Nginx, Blackbox Exporter for HTTP/SSL probes.
- AlertManager: receives alerts from Prometheus, deduplicates them, groups them, and routes them to the right channels with silence and inhibition handling.
- Grafana: visualization interface that queries Prometheus via PromQL to display interactive dashboards. It stores nothing and collects nothing.
Installation: complete Docker Compose stack
Deploying the full stack with Docker Compose is the most reproducible method. A single file describes the entire monitoring infrastructure.
File structure
mkdir -p ~/monitoring/{prometheus,grafana,alertmanager}
mkdir -p ~/monitoring/prometheus/rules
mkdir -p ~/monitoring/grafana/{provisioning/datasources,provisioning/dashboards,dashboards}
cd ~/monitoring
Complete Docker Compose
# ~/monitoring/docker-compose.yml
version: '3.8'
networks:
monitoring:
driver: bridge
volumes:
prometheus_data:
driver: local
grafana_data:
driver: local
alertmanager_data:
driver: local
services:
# ─── Prometheus ────────────────────────────────────────────────
prometheus:
image: prom/prometheus:v2.53.0
container_name: prometheus
restart: unless-stopped
user: "65534:65534" # nobody:nobody, no root
ports:
- "127.0.0.1:9090:9090" # Listen on localhost only
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./prometheus/rules:/etc/prometheus/rules:ro
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
- '--storage.tsdb.retention.size=10GB'
- '--web.enable-lifecycle'
- '--web.enable-admin-api'
- '--web.console.libraries=/usr/share/prometheus/console_libraries'
- '--web.console.templates=/usr/share/prometheus/consoles'
networks:
- monitoring
healthcheck:
test: ["CMD", "wget", "--quiet", "--tries=1", "--spider", "http://localhost:9090/-/healthy"]
interval: 30s
timeout: 10s
retries: 3
start_period: 30s
# ─── Node Exporter ─────────────────────────────────────────────
node-exporter:
image: prom/node-exporter:v1.8.2
container_name: node-exporter
restart: unless-stopped
pid: host # Access to host metrics
ports:
- "127.0.0.1:9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.rootfs=/rootfs'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
- '--collector.systemd'
networks:
- monitoring
# ─── AlertManager ──────────────────────────────────────────────
alertmanager:
image: prom/alertmanager:v0.27.0
container_name: alertmanager
restart: unless-stopped
user: "65534:65534"
ports:
- "127.0.0.1:9093:9093"
volumes:
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
- alertmanager_data:/alertmanager
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
- '--web.external-url=http://localhost:9093'
- '--cluster.listen-address=' # Disable clustering for a single instance
networks:
- monitoring
depends_on:
- prometheus
# ─── Grafana ───────────────────────────────────────────────────
grafana:
image: grafana/grafana:11.1.0
container_name: grafana
restart: unless-stopped
user: "472:472"
ports:
- "127.0.0.1:3000:3000"
environment:
GF_SECURITY_ADMIN_USER: admin
GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_ADMIN_PASSWORD:-changeme}
GF_USERS_ALLOW_SIGN_UP: "false"
GF_ANALYTICS_REPORTING_ENABLED: "false"
GF_SERVER_ROOT_URL: https://grafana.example.com
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning:ro
- ./grafana/dashboards:/var/lib/grafana/dashboards:ro
networks:
- monitoring
depends_on:
- prometheus
healthcheck:
test: ["CMD-SHELL", "wget --quiet --tries=1 --spider http://localhost:3000/api/health || exit 1"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s
# ─── Blackbox Exporter ─────────────────────────────────────────
blackbox-exporter:
image: prom/blackbox-exporter:v0.25.0
container_name: blackbox-exporter
restart: unless-stopped
ports:
- "127.0.0.1:9115:9115"
volumes:
- ./prometheus/blackbox.yml:/etc/blackbox_exporter/config.yml:ro
networks:
- monitoring
Note that all ports are bound to
127.0.0.1 only. Never expose Prometheus, AlertManager or Node Exporter directly on the Internet. Use an Nginx reverse proxy with TLS to access Grafana from outside.
Prometheus configuration
prometheus.yml: scrape_configs and service discovery
# ~/monitoring/prometheus/prometheus.yml
global:
scrape_interval: 15s # Collect every 15 seconds
evaluation_interval: 15s # Evaluate rules every 15s
scrape_timeout: 10s # Timeout per scrape
# Labels added to all metrics from this instance
external_labels:
datacenter: 'paris-1'
environment: 'production'
# Loading the alerting rules
rule_files:
- "/etc/prometheus/rules/*.yml"
# AlertManager configuration
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:9093"]
timeout: 10s
scrape_configs:
# ── Prometheus itself ──────────────────────────────────────────
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# ── Node Exporter ──────────────────────────────────────────────
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
labels:
hostname: 'monitoring-server'
role: 'monitoring'
# ── Remote servers (static targets) ────────────────────────────
- job_name: 'servers'
scrape_interval: 30s # Override per job
static_configs:
- targets:
- '192.168.1.10:9100'
- '192.168.1.11:9100'
- '192.168.1.12:9100'
labels:
environment: 'production'
role: 'web'
- targets:
- '192.168.1.20:9100'
- '192.168.1.21:9100'
labels:
environment: 'production'
role: 'database'
# ── Blackbox Exporter (HTTP probes) ───────────────────────────
- job_name: 'blackbox-http'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- 'https://example.com'
- 'https://api.example.com/health'
- 'https://grafana.example.com'
relabel_configs:
# Copy the target URL into the ?target= parameter
- source_labels: [__address__]
target_label: __param_target
# Use the URL as the "instance" label
- source_labels: [__param_target]
target_label: instance
# Redirect the scrape to the Blackbox Exporter
- target_label: __address__
replacement: blackbox-exporter:9115
# ── Blackbox Exporter (SSL expiry) ────────────────────────────
- job_name: 'blackbox-ssl'
metrics_path: /probe
params:
module: [ssl_expiry]
static_configs:
- targets:
- 'example.com:443'
- 'api.example.com:443'
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115
# ── File-based service discovery (for dynamic infra) ──────────
- job_name: 'dynamic-servers'
file_sd_configs:
- files:
- /etc/prometheus/targets/*.json
refresh_interval: 1m # Reload the list every minute
Blackbox Exporter configuration
# ~/monitoring/prometheus/blackbox.yml
modules:
# HTTP check with status code 200
http_2xx:
prober: http
timeout: 5s
http:
valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
valid_status_codes: [200, 201, 204]
method: GET
follow_redirects: true
preferred_ip_protocol: "ip4"
tls_config:
insecure_skip_verify: false
# SSL expiry check only
ssl_expiry:
prober: http
timeout: 5s
http:
method: HEAD
fail_if_not_ssl: true
tls_config:
insecure_skip_verify: false
# ICMP ping
icmp:
prober: icmp
timeout: 5s
icmp:
preferred_ip_protocol: "ip4"
# TCP port check
tcp_connect:
prober: tcp
timeout: 5s
PromQL: essential queries, annotated
PromQL operates on two kinds of selectors: instant vectors (an instantaneous value) and range vectors (a set of values over a window, written [5m]). The golden rule: always use rate() or increase() on counters (metrics suffixed with _total), never a raw value.
5 essential queries for monitoring a Linux server
# ─── 1. CPU usage as a percentage ─────────────────────────────────────────────
# Principle: 100% - % of time spent in "idle" mode
# rate() computes the per-second rate of change over 5 minutes (smooths spikes)
# avg by (instance) aggregates all CPU cores per server
100 - (
avg by (instance) (
rate(node_cpu_seconds_total{mode="idle"}[5m])
) * 100
)
# ─── 2. Memory used as a percentage ───────────────────────────────────────────
# MemAvailable includes reclaimable memory (caches) = truly free memory
# More reliable than (MemTotal - MemFree), which ignores Linux caches
(1 - (
node_memory_MemAvailable_bytes /
node_memory_MemTotal_bytes
)) * 100
# ─── 3. Disk space used as a percentage ───────────────────────────────────────
# Filtered on the root filesystem, adjust to your mount points
# node_filesystem_avail_bytes = space available to non-root users
(1 - (
node_filesystem_avail_bytes{mountpoint="/", fstype!="tmpfs"}
/
node_filesystem_size_bytes{mountpoint="/", fstype!="tmpfs"}
)) * 100
# ─── 4. Inbound network traffic (bits/s) ──────────────────────────────────────
# irate() uses the last two points to detect instantaneous spikes
# Multiply by 8 to convert bytes → bits
# Adjust "eth0" to your main network interface
irate(node_network_receive_bytes_total{device="eth0"}[5m]) * 8
# ─── 5. 95th percentile of HTTP request duration ──────────────────────────────
# histogram_quantile reconstructs percentiles from the buckets
# Requires an exporter that exposes histograms (nginx, traefik, etc.)
# the "le" label means "less than or equal to" (the bucket's upper bound)
histogram_quantile(0.95,
sum by (le, job) (
rate(http_request_duration_seconds_bucket[5m])
)
)
Advanced queries: topk, prediction, error rate
# Top 5 processes by CPU consumption
topk(5,
sum by (groupname) (
rate(namedprocess_namegroup_cpu_seconds_total[5m])
)
)
# Prediction: in how many hours will the disk be full?
# predict_linear projects the trend of the last 6 hours
# Alert if it is predicted to fill within less than 48h
predict_linear(
node_filesystem_avail_bytes{mountpoint="/"}[6h],
48 * 3600
) < 0
# HTTP error rate (4xx + 5xx) as a percentage
(
sum(rate(nginx_http_requests_total{status=~"[45].."}[5m]))
/
sum(rate(nginx_http_requests_total[5m]))
) * 100
# Availability over 24h (for an SLO dashboard)
avg_over_time(up{job="servers"}[24h]) * 100
Prometheus alerting rules
# ~/monitoring/prometheus/rules/node_alerts.yml
groups:
- name: instance_availability
interval: 15s # Override of the global evaluation_interval
rules:
# ── Instance unreachable ─────────────────────────────────────────────────
- alert: InstanceDown
expr: up == 0
for: 2m # Fires only if DOWN for 2 minutes
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} unreachable"
description: "The job {{ $labels.job }} on {{ $labels.instance }} has not responded for 2 minutes. Check the state of the server and the service."
runbook: "https://wiki.example.com/runbooks/instance-down"
- name: system_resources
rules:
# ── High CPU ─────────────────────────────────────────────────────────────
- alert: HighCpuUsage
expr: |
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
for: 5m
labels:
severity: critical
annotations:
summary: "Critical CPU on {{ $labels.instance }}"
description: "CPU usage of {{ printf \"%.1f\" $value }}% on {{ $labels.instance }} for 5 minutes."
- alert: HighCpuUsageWarning
expr: |
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 75
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU on {{ $labels.instance }}"
description: "CPU usage of {{ printf \"%.1f\" $value }}% on {{ $labels.instance }} for 10 minutes."
# ── Critical memory ──────────────────────────────────────────────────────
- alert: HighMemoryUsage
expr: |
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 90
for: 5m
labels:
severity: critical
annotations:
summary: "Critical memory on {{ $labels.instance }}"
description: "{{ printf \"%.1f\" $value }}% of memory used on {{ $labels.instance }}."
# ── Disk > 85% ────────────────────────────────────────────────────────────
- alert: DiskSpaceWarning
expr: |
(1 - node_filesystem_avail_bytes{mountpoint="/", fstype!="tmpfs"} /
node_filesystem_size_bytes{mountpoint="/", fstype!="tmpfs"}) * 100 > 85
for: 10m
labels:
severity: warning
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "{{ printf \"%.1f\" $value }}% of the disk used on {{ $labels.instance }} (/ partition)."
- alert: DiskSpaceCritical
expr: |
(1 - node_filesystem_avail_bytes{mountpoint="/", fstype!="tmpfs"} /
node_filesystem_size_bytes{mountpoint="/", fstype!="tmpfs"}) * 100 > 95
for: 5m
labels:
severity: critical
annotations:
summary: "Disk almost full on {{ $labels.instance }}"
description: "{{ printf \"%.1f\" $value }}% of the disk used on {{ $labels.instance }}. Urgent action required."
# ── Disk full within 48h ─────────────────────────────────────────────────
- alert: DiskWillFillIn48h
expr: |
predict_linear(node_filesystem_avail_bytes{mountpoint="/", fstype!="tmpfs"}[6h], 48 * 3600) < 0
for: 1h
labels:
severity: warning
annotations:
summary: "Disk predicted full within 48h on {{ $labels.instance }}"
description: "Based on the trend of the last 6 hours, the disk on {{ $labels.instance }} will be full in less than 48 hours."
- name: ssl_certificates
rules:
# ── SSL certificate expiring within 30 days ──────────────────────────────
- alert: SslCertificateExpiringSoon
expr: |
(probe_ssl_earliest_cert_expiry - time()) / 86400 < 30
for: 1h
labels:
severity: warning
annotations:
summary: "SSL certificate expiring soon for {{ $labels.instance }}"
description: "The SSL certificate for {{ $labels.instance }} expires in {{ printf \"%.0f\" $value }} days."
- alert: SslCertificateExpired
expr: |
(probe_ssl_earliest_cert_expiry - time()) / 86400 < 7
for: 1h
labels:
severity: critical
annotations:
summary: "Critical SSL certificate for {{ $labels.instance }}"
description: "The SSL certificate for {{ $labels.instance }} expires in {{ printf \"%.0f\" $value }} days. Immediate action required."
- name: http_availability
rules:
# ── HTTP endpoint unreachable ────────────────────────────────────────────
- alert: HttpEndpointDown
expr: probe_success == 0
for: 3m
labels:
severity: critical
annotations:
summary: "HTTP endpoint unreachable: {{ $labels.instance }}"
description: "The URL {{ $labels.instance }} has not responded for 3 minutes."
AlertManager: routes, receivers, inhibitions
# ~/monitoring/alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
# Global SMTP configuration
smtp_smarthost: 'smtp.gmail.com:587'
smtp_from: '[email protected]'
smtp_auth_username: '[email protected]'
smtp_auth_password: 'your-app-password'
smtp_require_tls: true
# ─── Routing tree ───────────────────────────────────────────────────────────────
route:
# Group by alertname + datacenter to avoid duplicates
group_by: ['alertname', 'datacenter', 'job']
group_wait: 30s # Wait 30s to group the initial alerts
group_interval: 5m # Interval between notifications for an active group
repeat_interval: 4h # Repeat if the alert persists
receiver: 'email-ops' # Default receiver
routes:
# Critical alerts → Slack as priority + frequent repeat
- match:
severity: critical
receiver: 'slack-critical'
group_wait: 10s # Less wait for critical alerts
repeat_interval: 30m
continue: true # Continue to the other routes (email too)
# Critical alerts → email as well
- match:
severity: critical
receiver: 'email-ops'
# Warning alerts → separate Slack channel
- match:
severity: warning
receiver: 'slack-warning'
repeat_interval: 6h
# ─── Receivers ────────────────────────────────────────────────────────────────
receivers:
- name: 'email-ops'
email_configs:
- to: '[email protected]'
subject: '[{{ .Status | toUpper }}] {{ .CommonLabels.alertname }} - {{ .CommonAnnotations.summary }}'
body: |
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
Severity: {{ .Labels.severity }}
Instance: {{ .Labels.instance }}
Started: {{ .StartsAt.Format "2006-01-02 15:04:05" }}
{{ end }}
send_resolved: true
headers:
X-Priority: '1'
- name: 'slack-critical'
slack_configs:
- api_url: 'https://hooks.slack.com/services/TXXXXXXXX/BXXXXXXXX/xxxxxxxxxxxxxxxxxxxxxxxx'
channel: '#critical-alerts'
username: 'AlertManager'
icon_emoji: ':rotating_light:'
title: '[{{ .Status | toUpper }}] {{ .CommonLabels.alertname }}'
text: |
{{ range .Alerts }}
*{{ .Annotations.summary }}*
{{ .Annotations.description }}
*Instance:* {{ .Labels.instance }}
*Datacenter:* {{ .Labels.datacenter }}
{{ end }}
color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
send_resolved: true
- name: 'slack-warning'
slack_configs:
- api_url: 'https://hooks.slack.com/services/TXXXXXXXX/BXXXXXXXX/xxxxxxxxxxxxxxxxxxxxxxxx'
channel: '#monitoring-alerts'
username: 'AlertManager'
icon_emoji: ':warning:'
title: '[WARNING] {{ .CommonLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ .Annotations.description }}{{ end }}'
send_resolved: true
# ─── Inhibitions ──────────────────────────────────────────────────────────────
inhibit_rules:
# If the instance is DOWN, suppress the CPU/RAM/Disk alerts for the same instance
- source_match:
alertname: 'InstanceDown'
target_match_re:
alertname: '(HighCpuUsage|HighMemoryUsage|DiskSpaceWarning|DiskSpaceCritical)'
equal: ['instance']
# If a critical alert exists, suppress the warnings with the same alertname
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
Grafana: dashboards, variables and provisioning as code
Automatically provisioned datasource
Grafana provisioning lets you deploy datasources and dashboards without going through the web interface. Configuration as code, versioned in Git.
# ~/monitoring/grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
uid: prometheus-uid # Fixed UID to reference in dashboards
editable: false
jsonData:
httpMethod: POST
prometheusType: Prometheus
prometheusVersion: 2.53.0
timeInterval: 15s
queryTimeout: 60s
Dashboard provisioning
# ~/monitoring/grafana/provisioning/dashboards/default.yml
apiVersion: 1
providers:
- name: 'default'
orgId: 1
folder: 'Production'
type: file
disableDeletion: true # Prevents deletion via the UI
updateIntervalSeconds: 30 # Reloads the files every 30s
allowUiUpdates: false # UI changes do not persist
options:
path: /var/lib/grafana/dashboards
Dashboard as code: simplified Node Exporter
{
"title": "Infrastructure Overview",
"uid": "infra-overview",
"tags": ["production", "node-exporter"],
"refresh": "30s",
"time": { "from": "now-3h", "to": "now" },
"templating": {
"list": [
{
"name": "instance",
"type": "query",
"label": "Server",
"datasource": "Prometheus",
"query": "label_values(node_uname_info, instance)",
"refresh": 2,
"multi": false,
"includeAll": true,
"allValue": ".*"
}
]
},
"panels": [
{
"type": "stat",
"title": "CPU Usage",
"gridPos": { "x": 0, "y": 0, "w": 6, "h": 4 },
"targets": [{
"expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode='idle', instance=~'$instance'}[5m])) * 100)",
"legendFormat": "{{ instance }}"
}],
"options": { "reduceOptions": { "calcs": ["lastNotNull"] } },
"fieldConfig": {
"defaults": {
"unit": "percent",
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 75 },
{ "color": "red", "value": 90 }
]
}
}
}
}
]
}
Importing the Node Exporter Full dashboard (ID 1860)
Grafana provides a catalog of community dashboards. The 1860 dashboard (Node Exporter Full) covers all system metrics in a single import:
- Go to Dashboards > New > Import
- Enter the ID 1860 in the "Import via grafana.com" field
- Select your Prometheus datasource from the dropdown menu
- Click Import
Other useful dashboards: 13978 (Node Exporter Quickstart), 7587 (Docker monitoring), 9614 (Nginx), 9628 (PostgreSQL).
Dashboard variables
Variables turn a static dashboard into an interactive, multi-server tool. Create an instance variable via Dashboard Settings > Variables > New variable:
- Type: Query
- Query:
label_values(node_uname_info, instance) - Refresh: On time range change
- Multi-value: enabled to compare several servers
Then use $instance in all your queries: node_cpu_seconds_total{instance=~"$instance"}. The filter applies to all panels simultaneously.
Automatic annotations from AlertManager
Grafana annotations let you overlay alert events on graphs. Configure an automatic annotation in the dashboard settings:
- Source: Prometheus
- Query:
ALERTS{alertstate="firing"} - Title:
{{alertname}} - Tags:
{{severity}}
Additional exporters
Nginx Exporter
# Enable stub_status in Nginx
# /etc/nginx/conf.d/stub_status.conf
server {
listen 127.0.0.1:8080;
location /nginx_status {
stub_status on;
access_log off;
allow 127.0.0.1;
deny all;
}
}
# Add to docker-compose.yml
nginx-exporter:
image: nginx/nginx-prometheus-exporter:1.1.0
container_name: nginx-exporter
restart: unless-stopped
ports:
- "127.0.0.1:9113:9113"
command:
- -nginx.scrape-uri=http://host-gateway:8080/nginx_status
networks:
- monitoring
extra_hosts:
- "host-gateway:host-gateway"
PostgreSQL Exporter
# Add to docker-compose.yml
postgres-exporter:
image: prometheuscommunity/postgres-exporter:v0.15.0
container_name: postgres-exporter
restart: unless-stopped
environment:
DATA_SOURCE_NAME: "postgresql://exporter:password@postgres:5432/postgres?sslmode=disable"
ports:
- "127.0.0.1:9187:9187"
networks:
- monitoring
-- Create the PostgreSQL user for the exporter
CREATE USER exporter WITH PASSWORD 'password';
ALTER USER exporter SET SEARCH_PATH TO exporter,pg_catalog;
GRANT CONNECT ON DATABASE postgres TO exporter;
GRANT pg_monitor TO exporter;
Adding the exporters to Prometheus
# Add to prometheus.yml / scrape_configs
- job_name: 'nginx'
static_configs:
- targets: ['nginx-exporter:9113']
- job_name: 'postgresql'
static_configs:
- targets: ['postgres-exporter:9187']
Startup and operations
# Start the whole stack
cd ~/monitoring
docker compose up -d
# Check the state of all services
docker compose ps
# View logs in real time
docker compose logs -f prometheus
# Validate the Prometheus configuration before reload
docker compose exec prometheus promtool check config /etc/prometheus/prometheus.yml
# Validate the alerting rules
docker compose exec prometheus promtool check rules /etc/prometheus/rules/node_alerts.yml
# Reload Prometheus without restarting (hot reload)
curl -X POST http://localhost:9090/-/reload
# Check the active targets
curl -s http://localhost:9090/api/v1/targets | python3 -m json.tool | grep -E "health|job|instance"
# List active alerts in AlertManager
curl -s http://localhost:9093/api/v2/alerts | python3 -m json.tool
High availability: Thanos or Grafana Mimir
Prometheus alone has two limitations in critical production: a single instance (SPOF), and retention limited by local disk capacity (15-30 days recommended). Thanos and Grafana Mimir solve both problems.
Thanos architecture
Prometheus instance 1 ──── Thanos Sidecar ────┐
│
Prometheus instance 2 ──── Thanos Sidecar ────┤──► Thanos Query ──► Grafana
│ (dedup)
Prometheus HA pair S3 / Minio ◄────────┘
(long-term retention) Thanos Store
(query S3)
Deploying the Thanos Sidecar
# Docker Compose extension for Thanos
thanos-sidecar:
image: thanosio/thanos:v0.35.0
container_name: thanos-sidecar
command:
- sidecar
- --prometheus.url=http://prometheus:9090
- --tsdb.path=/prometheus
- --grpc-address=0.0.0.0:10901
- --http-address=0.0.0.0:10902
- --objstore.config-file=/etc/thanos/s3.yml
volumes:
- prometheus_data:/prometheus
- ./thanos/s3.yml:/etc/thanos/s3.yml:ro
networks:
- monitoring
thanos-query:
image: thanosio/thanos:v0.35.0
container_name: thanos-query
command:
- query
- --http-address=0.0.0.0:9091
- --endpoint=thanos-sidecar:10901
- --query.replica-label=replica
ports:
- "127.0.0.1:9091:9091"
networks:
- monitoring
# ~/monitoring/thanos/s3.yml
type: S3
config:
bucket: monitoring-thanos
endpoint: s3.eu-west-3.amazonaws.com
region: eu-west-3
access_key: AKIAXXXXXXXXXXXXXXXX
secret_key: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Grafana Mimir offers the same features as Thanos (HA, S3 storage, unlimited retention) with a monolithic architecture that is simpler to deploy. It natively accepts Prometheus remote_write via the Prometheus protocol. Ideal if you are getting started with Prometheus HA.
Retention and sizing
| Targets | Metrics/instance | Recommended RAM | Disk (30d) |
|---|---|---|---|
| 5 servers | ~1,000 | 512 MB | 5 GB |
| 20 servers | ~1,000 | 2 GB | 20 GB |
| 100 servers | ~1,000 | 8 GB | 100 GB |
| 1,000 servers | ~1,000 | 32 GB | → Thanos/Mimir |
Troubleshooting
Prometheus is not scraping a target
# List the targets and their state
curl -s http://localhost:9090/api/v1/targets | \
python3 -c "import sys,json; [print(t['scrapeUrl'], t['health'], t.get('lastError','')) for t in json.load(sys.stdin)['data']['activeTargets']]"
# Manually test the target endpoint
curl -v http://192.168.1.10:9100/metrics | head -20
# Check the Prometheus logs
docker compose logs prometheus --tail=50 | grep -i error
# Test connectivity from the Prometheus container
docker compose exec prometheus wget -qO- http://node-exporter:9100/metrics | head -5
Grafana is not loading data
# Check that Prometheus responds
curl -s http://localhost:9090/api/v1/query?query=up | python3 -m json.tool
# Test the datasource from the Grafana UI
# Configuration > Data Sources > Prometheus > Save & Test
# Check the Grafana logs
docker compose logs grafana --tail=50 | grep -i error
AlertManager is not receiving alerts
# Check the Prometheus -> AlertManager connection
curl -s http://localhost:9090/api/v1/alertmanagers | python3 -m json.tool
# See the pending alerts in Prometheus
curl -s http://localhost:9090/api/v1/alerts | python3 -m json.tool
# Check the AlertManager config
docker compose exec alertmanager amtool check-config /etc/alertmanager/alertmanager.yml
# Test sending a manual alert
curl -XPOST http://localhost:9093/api/v2/alerts -H "Content-Type: application/json" -d '[{
"labels": {"alertname": "TestAlert", "severity": "warning"},
"annotations": {"summary": "Test from curl"}
}]'
Your monitoring infrastructure is complete. Start by importing the 1860 dashboard into Grafana to immediately get a comprehensive view of your Linux servers. Then gradually add your alerting rules and your business exporters as needed.
Conclusion
You now have a production-grade monitoring stack that covers the entire observability cycle:
- Prometheus collects and stores metrics from all your servers via a reliable pull model, with automatic detection of downed instances.
- Node Exporter + Blackbox Exporter expose system metrics and monitor your HTTP endpoints and SSL certificates.
- PromQL enables expressive queries to compute CPU, RAM, disk, error rate and latency percentiles.
- Grafana visualizes everything in interactive dashboards with variables, annotations and native alerting. The 1860 dashboard covers 95% of needs right after import.
- AlertManager routes alerts intelligently with grouping, inhibitions and silences to avoid alert storms.
- Thanos or Mimir extend the stack for high availability and long-term retention as the infrastructure grows.
To go further, explore Loki (log centralization, same Grafana stack), Tempo (distributed tracing) and Pyroscope (continuous profiling) to complete the three pillars of observability: metrics, logs and traces.
Comments