Devops
Difficulty: Intermédiaire
19 min read

Prometheus & Grafana: Complete Monitoring for Your Linux Infrastructure | Morgann Riu

Complete Prometheus & Grafana 2026 guide: Docker Compose, PromQL, AlertManager, dashboards, exporters, Thanos HA. Monitor your entire Linux infrastructure.

Back to tutorials
Prerequisites
Docker and Docker Compose installed on your server. Basic Linux administration knowledge (systemd, YAML configuration files). For installing Docker, see the Docker installation guide.

Why Prometheus? Pull vs Push, the real architecture

Monitoring a Linux infrastructure long boiled down to Nagios and its plugins, or Zabbix and its agent. These tools work on a push model: each agent sends its data to a central collector. Prometheus reverses this paradigm with the pull model: it is the Prometheus server that periodically queries each target on its /metrics endpoint.

This reversal is not cosmetic. It fundamentally changes how monitoring is operated:

  • Detecting dead instances: if Prometheus cannot scrape a target, the up metric drops to 0. With push, a silent instance is undetectable.
  • Centralized configuration: all targets are defined in Prometheus, not scattered across each agent. A single YAML file to see the entire infrastructure.
  • Simple debugging: the /metrics endpoint is plain text readable with curl, with no intermediate agent to debug.
  • Standardized ecosystem: the Prometheus exposition format has become an OpenMetrics standard adopted by hundreds of applications (Nginx, PostgreSQL, Redis, Kubernetes, etc.).

The four components of the stack


┌─────────────────┐   scrape /metrics   ┌──────────────────┐   alerts     ┌─────────────────┐
│  Node Exporter  │ ◄─────────────────── │   Prometheus     │ ──────────►  │  AlertManager   │
│  (port 9100)    │                      │   (port 9090)    │              │  (port 9093)    │
│  Nginx Exporter │ ◄─────────────────── │   local TSDB     │              │  Slack / Email  │
│  (port 9113)    │                      │   PromQL         │              │  PagerDuty      │
│  Blackbox Exp.  │ ◄─────────────────── │   Alerting rules │              └─────────────────┘
│  (port 9115)    │                      └──────────────────┘
└─────────────────┘                               │
                                                  │ datasource
                                                  ▼
                                       ┌──────────────────┐
                                       │     Grafana       │
                                       │   (port 3000)    │
                                       │   Dashboards     │
                                       │   Alerting       │
                                       └──────────────────┘
  • Prometheus: collects (scrapes) metrics, stores them in a local time-series database (TSDB), and evaluates alerting rules every 15 seconds.
  • Exporters: translate system or application metrics into the Prometheus format. Node Exporter for Linux, Nginx Exporter for Nginx, Blackbox Exporter for HTTP/SSL probes.
  • AlertManager: receives alerts from Prometheus, deduplicates them, groups them, and routes them to the right channels with silence and inhibition handling.
  • Grafana: visualization interface that queries Prometheus via PromQL to display interactive dashboards. It stores nothing and collects nothing.

Installation: complete Docker Compose stack

Deploying the full stack with Docker Compose is the most reproducible method. A single file describes the entire monitoring infrastructure.

File structure

mkdir -p ~/monitoring/{prometheus,grafana,alertmanager}
mkdir -p ~/monitoring/prometheus/rules
mkdir -p ~/monitoring/grafana/{provisioning/datasources,provisioning/dashboards,dashboards}
cd ~/monitoring

Complete Docker Compose

# ~/monitoring/docker-compose.yml
version: '3.8'

networks:
  monitoring:
    driver: bridge

volumes:
  prometheus_data:
    driver: local
  grafana_data:
    driver: local
  alertmanager_data:
    driver: local

services:

  # ─── Prometheus ────────────────────────────────────────────────
  prometheus:
    image: prom/prometheus:v2.53.0
    container_name: prometheus
    restart: unless-stopped
    user: "65534:65534"           # nobody:nobody, no root
    ports:
      - "127.0.0.1:9090:9090"    # Listen on localhost only
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./prometheus/rules:/etc/prometheus/rules:ro
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
      - '--storage.tsdb.retention.size=10GB'
      - '--web.enable-lifecycle'
      - '--web.enable-admin-api'
      - '--web.console.libraries=/usr/share/prometheus/console_libraries'
      - '--web.console.templates=/usr/share/prometheus/consoles'
    networks:
      - monitoring
    healthcheck:
      test: ["CMD", "wget", "--quiet", "--tries=1", "--spider", "http://localhost:9090/-/healthy"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 30s

  # ─── Node Exporter ─────────────────────────────────────────────
  node-exporter:
    image: prom/node-exporter:v1.8.2
    container_name: node-exporter
    restart: unless-stopped
    pid: host                    # Access to host metrics
    ports:
      - "127.0.0.1:9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.rootfs=/rootfs'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
      - '--collector.systemd'
    networks:
      - monitoring

  # ─── AlertManager ──────────────────────────────────────────────
  alertmanager:
    image: prom/alertmanager:v0.27.0
    container_name: alertmanager
    restart: unless-stopped
    user: "65534:65534"
    ports:
      - "127.0.0.1:9093:9093"
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
      - alertmanager_data:/alertmanager
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--storage.path=/alertmanager'
      - '--web.external-url=http://localhost:9093'
      - '--cluster.listen-address='      # Disable clustering for a single instance
    networks:
      - monitoring
    depends_on:
      - prometheus

  # ─── Grafana ───────────────────────────────────────────────────
  grafana:
    image: grafana/grafana:11.1.0
    container_name: grafana
    restart: unless-stopped
    user: "472:472"
    ports:
      - "127.0.0.1:3000:3000"
    environment:
      GF_SECURITY_ADMIN_USER: admin
      GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_ADMIN_PASSWORD:-changeme}
      GF_USERS_ALLOW_SIGN_UP: "false"
      GF_ANALYTICS_REPORTING_ENABLED: "false"
      GF_SERVER_ROOT_URL: https://grafana.example.com
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning:ro
      - ./grafana/dashboards:/var/lib/grafana/dashboards:ro
    networks:
      - monitoring
    depends_on:
      - prometheus
    healthcheck:
      test: ["CMD-SHELL", "wget --quiet --tries=1 --spider http://localhost:3000/api/health || exit 1"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 60s

  # ─── Blackbox Exporter ─────────────────────────────────────────
  blackbox-exporter:
    image: prom/blackbox-exporter:v0.25.0
    container_name: blackbox-exporter
    restart: unless-stopped
    ports:
      - "127.0.0.1:9115:9115"
    volumes:
      - ./prometheus/blackbox.yml:/etc/blackbox_exporter/config.yml:ro
    networks:
      - monitoring
Network security
Note that all ports are bound to 127.0.0.1 only. Never expose Prometheus, AlertManager or Node Exporter directly on the Internet. Use an Nginx reverse proxy with TLS to access Grafana from outside.

Prometheus configuration

prometheus.yml: scrape_configs and service discovery

# ~/monitoring/prometheus/prometheus.yml
global:
  scrape_interval: 15s          # Collect every 15 seconds
  evaluation_interval: 15s      # Evaluate rules every 15s
  scrape_timeout: 10s           # Timeout per scrape

  # Labels added to all metrics from this instance
  external_labels:
    datacenter: 'paris-1'
    environment: 'production'

# Loading the alerting rules
rule_files:
  - "/etc/prometheus/rules/*.yml"

# AlertManager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]
      timeout: 10s

scrape_configs:
  # ── Prometheus itself ──────────────────────────────────────────
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # ── Node Exporter ──────────────────────────────────────────────
  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']
        labels:
          hostname: 'monitoring-server'
          role: 'monitoring'

  # ── Remote servers (static targets) ────────────────────────────
  - job_name: 'servers'
    scrape_interval: 30s        # Override per job
    static_configs:
      - targets:
          - '192.168.1.10:9100'
          - '192.168.1.11:9100'
          - '192.168.1.12:9100'
        labels:
          environment: 'production'
          role: 'web'
      - targets:
          - '192.168.1.20:9100'
          - '192.168.1.21:9100'
        labels:
          environment: 'production'
          role: 'database'

  # ── Blackbox Exporter (HTTP probes) ───────────────────────────
  - job_name: 'blackbox-http'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - 'https://example.com'
          - 'https://api.example.com/health'
          - 'https://grafana.example.com'
    relabel_configs:
      # Copy the target URL into the ?target= parameter
      - source_labels: [__address__]
        target_label: __param_target
      # Use the URL as the "instance" label
      - source_labels: [__param_target]
        target_label: instance
      # Redirect the scrape to the Blackbox Exporter
      - target_label: __address__
        replacement: blackbox-exporter:9115

  # ── Blackbox Exporter (SSL expiry) ────────────────────────────
  - job_name: 'blackbox-ssl'
    metrics_path: /probe
    params:
      module: [ssl_expiry]
    static_configs:
      - targets:
          - 'example.com:443'
          - 'api.example.com:443'
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

  # ── File-based service discovery (for dynamic infra) ──────────
  - job_name: 'dynamic-servers'
    file_sd_configs:
      - files:
          - /etc/prometheus/targets/*.json
        refresh_interval: 1m    # Reload the list every minute

Blackbox Exporter configuration

# ~/monitoring/prometheus/blackbox.yml
modules:
  # HTTP check with status code 200
  http_2xx:
    prober: http
    timeout: 5s
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
      valid_status_codes: [200, 201, 204]
      method: GET
      follow_redirects: true
      preferred_ip_protocol: "ip4"
      tls_config:
        insecure_skip_verify: false

  # SSL expiry check only
  ssl_expiry:
    prober: http
    timeout: 5s
    http:
      method: HEAD
      fail_if_not_ssl: true
      tls_config:
        insecure_skip_verify: false

  # ICMP ping
  icmp:
    prober: icmp
    timeout: 5s
    icmp:
      preferred_ip_protocol: "ip4"

  # TCP port check
  tcp_connect:
    prober: tcp
    timeout: 5s

PromQL: essential queries, annotated

PromQL operates on two kinds of selectors: instant vectors (an instantaneous value) and range vectors (a set of values over a window, written [5m]). The golden rule: always use rate() or increase() on counters (metrics suffixed with _total), never a raw value.

5 essential queries for monitoring a Linux server

# ─── 1. CPU usage as a percentage ─────────────────────────────────────────────
# Principle: 100% - % of time spent in "idle" mode
# rate() computes the per-second rate of change over 5 minutes (smooths spikes)
# avg by (instance) aggregates all CPU cores per server
100 - (
  avg by (instance) (
    rate(node_cpu_seconds_total{mode="idle"}[5m])
  ) * 100
)

# ─── 2. Memory used as a percentage ───────────────────────────────────────────
# MemAvailable includes reclaimable memory (caches) = truly free memory
# More reliable than (MemTotal - MemFree), which ignores Linux caches
(1 - (
  node_memory_MemAvailable_bytes /
  node_memory_MemTotal_bytes
)) * 100

# ─── 3. Disk space used as a percentage ───────────────────────────────────────
# Filtered on the root filesystem, adjust to your mount points
# node_filesystem_avail_bytes = space available to non-root users
(1 - (
  node_filesystem_avail_bytes{mountpoint="/", fstype!="tmpfs"}
  /
  node_filesystem_size_bytes{mountpoint="/", fstype!="tmpfs"}
)) * 100

# ─── 4. Inbound network traffic (bits/s) ──────────────────────────────────────
# irate() uses the last two points to detect instantaneous spikes
# Multiply by 8 to convert bytes → bits
# Adjust "eth0" to your main network interface
irate(node_network_receive_bytes_total{device="eth0"}[5m]) * 8

# ─── 5. 95th percentile of HTTP request duration ──────────────────────────────
# histogram_quantile reconstructs percentiles from the buckets
# Requires an exporter that exposes histograms (nginx, traefik, etc.)
# the "le" label means "less than or equal to" (the bucket's upper bound)
histogram_quantile(0.95,
  sum by (le, job) (
    rate(http_request_duration_seconds_bucket[5m])
  )
)

Advanced queries: topk, prediction, error rate

# Top 5 processes by CPU consumption
topk(5,
  sum by (groupname) (
    rate(namedprocess_namegroup_cpu_seconds_total[5m])
  )
)

# Prediction: in how many hours will the disk be full?
# predict_linear projects the trend of the last 6 hours
# Alert if it is predicted to fill within less than 48h
predict_linear(
  node_filesystem_avail_bytes{mountpoint="/"}[6h],
  48 * 3600
) < 0

# HTTP error rate (4xx + 5xx) as a percentage
(
  sum(rate(nginx_http_requests_total{status=~"[45].."}[5m]))
  /
  sum(rate(nginx_http_requests_total[5m]))
) * 100

# Availability over 24h (for an SLO dashboard)
avg_over_time(up{job="servers"}[24h]) * 100

Prometheus alerting rules

# ~/monitoring/prometheus/rules/node_alerts.yml
groups:
  - name: instance_availability
    interval: 15s              # Override of the global evaluation_interval
    rules:
      # ── Instance unreachable ─────────────────────────────────────────────────
      - alert: InstanceDown
        expr: up == 0
        for: 2m                # Fires only if DOWN for 2 minutes
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} unreachable"
          description: "The job {{ $labels.job }} on {{ $labels.instance }} has not responded for 2 minutes. Check the state of the server and the service."
          runbook: "https://wiki.example.com/runbooks/instance-down"

  - name: system_resources
    rules:
      # ── High CPU ─────────────────────────────────────────────────────────────
      - alert: HighCpuUsage
        expr: |
          100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Critical CPU on {{ $labels.instance }}"
          description: "CPU usage of {{ printf \"%.1f\" $value }}% on {{ $labels.instance }} for 5 minutes."

      - alert: HighCpuUsageWarning
        expr: |
          100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 75
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High CPU on {{ $labels.instance }}"
          description: "CPU usage of {{ printf \"%.1f\" $value }}% on {{ $labels.instance }} for 10 minutes."

      # ── Critical memory ──────────────────────────────────────────────────────
      - alert: HighMemoryUsage
        expr: |
          (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 90
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Critical memory on {{ $labels.instance }}"
          description: "{{ printf \"%.1f\" $value }}% of memory used on {{ $labels.instance }}."

      # ── Disk > 85% ────────────────────────────────────────────────────────────
      - alert: DiskSpaceWarning
        expr: |
          (1 - node_filesystem_avail_bytes{mountpoint="/", fstype!="tmpfs"} /
               node_filesystem_size_bytes{mountpoint="/", fstype!="tmpfs"}) * 100 > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "{{ printf \"%.1f\" $value }}% of the disk used on {{ $labels.instance }} (/ partition)."

      - alert: DiskSpaceCritical
        expr: |
          (1 - node_filesystem_avail_bytes{mountpoint="/", fstype!="tmpfs"} /
               node_filesystem_size_bytes{mountpoint="/", fstype!="tmpfs"}) * 100 > 95
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Disk almost full on {{ $labels.instance }}"
          description: "{{ printf \"%.1f\" $value }}% of the disk used on {{ $labels.instance }}. Urgent action required."

      # ── Disk full within 48h ─────────────────────────────────────────────────
      - alert: DiskWillFillIn48h
        expr: |
          predict_linear(node_filesystem_avail_bytes{mountpoint="/", fstype!="tmpfs"}[6h], 48 * 3600) < 0
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Disk predicted full within 48h on {{ $labels.instance }}"
          description: "Based on the trend of the last 6 hours, the disk on {{ $labels.instance }} will be full in less than 48 hours."

  - name: ssl_certificates
    rules:
      # ── SSL certificate expiring within 30 days ──────────────────────────────
      - alert: SslCertificateExpiringSoon
        expr: |
          (probe_ssl_earliest_cert_expiry - time()) / 86400 < 30
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "SSL certificate expiring soon for {{ $labels.instance }}"
          description: "The SSL certificate for {{ $labels.instance }} expires in {{ printf \"%.0f\" $value }} days."

      - alert: SslCertificateExpired
        expr: |
          (probe_ssl_earliest_cert_expiry - time()) / 86400 < 7
        for: 1h
        labels:
          severity: critical
        annotations:
          summary: "Critical SSL certificate for {{ $labels.instance }}"
          description: "The SSL certificate for {{ $labels.instance }} expires in {{ printf \"%.0f\" $value }} days. Immediate action required."

  - name: http_availability
    rules:
      # ── HTTP endpoint unreachable ────────────────────────────────────────────
      - alert: HttpEndpointDown
        expr: probe_success == 0
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "HTTP endpoint unreachable: {{ $labels.instance }}"
          description: "The URL {{ $labels.instance }} has not responded for 3 minutes."

AlertManager: routes, receivers, inhibitions

# ~/monitoring/alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m

  # Global SMTP configuration
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_from: '[email protected]'
  smtp_auth_username: '[email protected]'
  smtp_auth_password: 'your-app-password'
  smtp_require_tls: true

# ─── Routing tree ───────────────────────────────────────────────────────────────
route:
  # Group by alertname + datacenter to avoid duplicates
  group_by: ['alertname', 'datacenter', 'job']
  group_wait: 30s              # Wait 30s to group the initial alerts
  group_interval: 5m           # Interval between notifications for an active group
  repeat_interval: 4h          # Repeat if the alert persists
  receiver: 'email-ops'        # Default receiver

  routes:
    # Critical alerts → Slack as priority + frequent repeat
    - match:
        severity: critical
      receiver: 'slack-critical'
      group_wait: 10s           # Less wait for critical alerts
      repeat_interval: 30m
      continue: true            # Continue to the other routes (email too)

    # Critical alerts → email as well
    - match:
        severity: critical
      receiver: 'email-ops'

    # Warning alerts → separate Slack channel
    - match:
        severity: warning
      receiver: 'slack-warning'
      repeat_interval: 6h

# ─── Receivers ────────────────────────────────────────────────────────────────
receivers:
  - name: 'email-ops'
    email_configs:
      - to: '[email protected]'
        subject: '[{{ .Status | toUpper }}] {{ .CommonLabels.alertname }} - {{ .CommonAnnotations.summary }}'
        body: |
          {{ range .Alerts }}
          Alert: {{ .Annotations.summary }}
          Description: {{ .Annotations.description }}
          Severity: {{ .Labels.severity }}
          Instance: {{ .Labels.instance }}
          Started: {{ .StartsAt.Format "2006-01-02 15:04:05" }}
          {{ end }}
        send_resolved: true
        headers:
          X-Priority: '1'

  - name: 'slack-critical'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/TXXXXXXXX/BXXXXXXXX/xxxxxxxxxxxxxxxxxxxxxxxx'
        channel: '#critical-alerts'
        username: 'AlertManager'
        icon_emoji: ':rotating_light:'
        title: '[{{ .Status | toUpper }}] {{ .CommonLabels.alertname }}'
        text: |
          {{ range .Alerts }}
          *{{ .Annotations.summary }}*
          {{ .Annotations.description }}
          *Instance:* {{ .Labels.instance }}
          *Datacenter:* {{ .Labels.datacenter }}
          {{ end }}
        color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
        send_resolved: true

  - name: 'slack-warning'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/TXXXXXXXX/BXXXXXXXX/xxxxxxxxxxxxxxxxxxxxxxxx'
        channel: '#monitoring-alerts'
        username: 'AlertManager'
        icon_emoji: ':warning:'
        title: '[WARNING] {{ .CommonLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ .Annotations.description }}{{ end }}'
        send_resolved: true

# ─── Inhibitions ──────────────────────────────────────────────────────────────
inhibit_rules:
  # If the instance is DOWN, suppress the CPU/RAM/Disk alerts for the same instance
  - source_match:
      alertname: 'InstanceDown'
    target_match_re:
      alertname: '(HighCpuUsage|HighMemoryUsage|DiskSpaceWarning|DiskSpaceCritical)'
    equal: ['instance']

  # If a critical alert exists, suppress the warnings with the same alertname
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']

Grafana: dashboards, variables and provisioning as code

Automatically provisioned datasource

Grafana provisioning lets you deploy datasources and dashboards without going through the web interface. Configuration as code, versioned in Git.

# ~/monitoring/grafana/provisioning/datasources/prometheus.yml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    uid: prometheus-uid         # Fixed UID to reference in dashboards
    editable: false
    jsonData:
      httpMethod: POST
      prometheusType: Prometheus
      prometheusVersion: 2.53.0
      timeInterval: 15s
      queryTimeout: 60s

Dashboard provisioning

# ~/monitoring/grafana/provisioning/dashboards/default.yml
apiVersion: 1

providers:
  - name: 'default'
    orgId: 1
    folder: 'Production'
    type: file
    disableDeletion: true       # Prevents deletion via the UI
    updateIntervalSeconds: 30   # Reloads the files every 30s
    allowUiUpdates: false       # UI changes do not persist
    options:
      path: /var/lib/grafana/dashboards

Dashboard as code: simplified Node Exporter

{
  "title": "Infrastructure Overview",
  "uid": "infra-overview",
  "tags": ["production", "node-exporter"],
  "refresh": "30s",
  "time": { "from": "now-3h", "to": "now" },
  "templating": {
    "list": [
      {
        "name": "instance",
        "type": "query",
        "label": "Server",
        "datasource": "Prometheus",
        "query": "label_values(node_uname_info, instance)",
        "refresh": 2,
        "multi": false,
        "includeAll": true,
        "allValue": ".*"
      }
    ]
  },
  "panels": [
    {
      "type": "stat",
      "title": "CPU Usage",
      "gridPos": { "x": 0, "y": 0, "w": 6, "h": 4 },
      "targets": [{
        "expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode='idle', instance=~'$instance'}[5m])) * 100)",
        "legendFormat": "{{ instance }}"
      }],
      "options": { "reduceOptions": { "calcs": ["lastNotNull"] } },
      "fieldConfig": {
        "defaults": {
          "unit": "percent",
          "thresholds": {
            "mode": "absolute",
            "steps": [
              { "color": "green", "value": null },
              { "color": "yellow", "value": 75 },
              { "color": "red", "value": 90 }
            ]
          }
        }
      }
    }
  ]
}

Importing the Node Exporter Full dashboard (ID 1860)

Grafana provides a catalog of community dashboards. The 1860 dashboard (Node Exporter Full) covers all system metrics in a single import:

  1. Go to Dashboards > New > Import
  2. Enter the ID 1860 in the "Import via grafana.com" field
  3. Select your Prometheus datasource from the dropdown menu
  4. Click Import

Other useful dashboards: 13978 (Node Exporter Quickstart), 7587 (Docker monitoring), 9614 (Nginx), 9628 (PostgreSQL).

Dashboard variables

Variables turn a static dashboard into an interactive, multi-server tool. Create an instance variable via Dashboard Settings > Variables > New variable:

  • Type: Query
  • Query: label_values(node_uname_info, instance)
  • Refresh: On time range change
  • Multi-value: enabled to compare several servers

Then use $instance in all your queries: node_cpu_seconds_total{instance=~"$instance"}. The filter applies to all panels simultaneously.

Automatic annotations from AlertManager

Grafana annotations let you overlay alert events on graphs. Configure an automatic annotation in the dashboard settings:

  • Source: Prometheus
  • Query: ALERTS{alertstate="firing"}
  • Title: {{alertname}}
  • Tags: {{severity}}

Additional exporters

Nginx Exporter

# Enable stub_status in Nginx
# /etc/nginx/conf.d/stub_status.conf
server {
    listen 127.0.0.1:8080;
    location /nginx_status {
        stub_status on;
        access_log off;
        allow 127.0.0.1;
        deny all;
    }
}
# Add to docker-compose.yml
  nginx-exporter:
    image: nginx/nginx-prometheus-exporter:1.1.0
    container_name: nginx-exporter
    restart: unless-stopped
    ports:
      - "127.0.0.1:9113:9113"
    command:
      - -nginx.scrape-uri=http://host-gateway:8080/nginx_status
    networks:
      - monitoring
    extra_hosts:
      - "host-gateway:host-gateway"

PostgreSQL Exporter

# Add to docker-compose.yml
  postgres-exporter:
    image: prometheuscommunity/postgres-exporter:v0.15.0
    container_name: postgres-exporter
    restart: unless-stopped
    environment:
      DATA_SOURCE_NAME: "postgresql://exporter:password@postgres:5432/postgres?sslmode=disable"
    ports:
      - "127.0.0.1:9187:9187"
    networks:
      - monitoring
-- Create the PostgreSQL user for the exporter
CREATE USER exporter WITH PASSWORD 'password';
ALTER USER exporter SET SEARCH_PATH TO exporter,pg_catalog;
GRANT CONNECT ON DATABASE postgres TO exporter;
GRANT pg_monitor TO exporter;

Adding the exporters to Prometheus

# Add to prometheus.yml / scrape_configs
  - job_name: 'nginx'
    static_configs:
      - targets: ['nginx-exporter:9113']

  - job_name: 'postgresql'
    static_configs:
      - targets: ['postgres-exporter:9187']

Startup and operations

# Start the whole stack
cd ~/monitoring
docker compose up -d

# Check the state of all services
docker compose ps

# View logs in real time
docker compose logs -f prometheus

# Validate the Prometheus configuration before reload
docker compose exec prometheus promtool check config /etc/prometheus/prometheus.yml

# Validate the alerting rules
docker compose exec prometheus promtool check rules /etc/prometheus/rules/node_alerts.yml

# Reload Prometheus without restarting (hot reload)
curl -X POST http://localhost:9090/-/reload

# Check the active targets
curl -s http://localhost:9090/api/v1/targets | python3 -m json.tool | grep -E "health|job|instance"

# List active alerts in AlertManager
curl -s http://localhost:9093/api/v2/alerts | python3 -m json.tool

High availability: Thanos or Grafana Mimir

Prometheus alone has two limitations in critical production: a single instance (SPOF), and retention limited by local disk capacity (15-30 days recommended). Thanos and Grafana Mimir solve both problems.

Thanos architecture


Prometheus instance 1 ──── Thanos Sidecar ────┐
                                               │
Prometheus instance 2 ──── Thanos Sidecar ────┤──► Thanos Query ──► Grafana
                                               │       (dedup)
Prometheus HA pair         S3 / Minio ◄────────┘
                           (long-term retention)   Thanos Store
                                                   (query S3)

Deploying the Thanos Sidecar

# Docker Compose extension for Thanos
  thanos-sidecar:
    image: thanosio/thanos:v0.35.0
    container_name: thanos-sidecar
    command:
      - sidecar
      - --prometheus.url=http://prometheus:9090
      - --tsdb.path=/prometheus
      - --grpc-address=0.0.0.0:10901
      - --http-address=0.0.0.0:10902
      - --objstore.config-file=/etc/thanos/s3.yml
    volumes:
      - prometheus_data:/prometheus
      - ./thanos/s3.yml:/etc/thanos/s3.yml:ro
    networks:
      - monitoring

  thanos-query:
    image: thanosio/thanos:v0.35.0
    container_name: thanos-query
    command:
      - query
      - --http-address=0.0.0.0:9091
      - --endpoint=thanos-sidecar:10901
      - --query.replica-label=replica
    ports:
      - "127.0.0.1:9091:9091"
    networks:
      - monitoring
# ~/monitoring/thanos/s3.yml
type: S3
config:
  bucket: monitoring-thanos
  endpoint: s3.eu-west-3.amazonaws.com
  region: eu-west-3
  access_key: AKIAXXXXXXXXXXXXXXXX
  secret_key: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Grafana Mimir: the simplified alternative
Grafana Mimir offers the same features as Thanos (HA, S3 storage, unlimited retention) with a monolithic architecture that is simpler to deploy. It natively accepts Prometheus remote_write via the Prometheus protocol. Ideal if you are getting started with Prometheus HA.

Retention and sizing

Targets Metrics/instance Recommended RAM Disk (30d)
5 servers~1,000512 MB5 GB
20 servers~1,0002 GB20 GB
100 servers~1,0008 GB100 GB
1,000 servers~1,00032 GB→ Thanos/Mimir

Troubleshooting

Prometheus is not scraping a target

# List the targets and their state
curl -s http://localhost:9090/api/v1/targets | \
  python3 -c "import sys,json; [print(t['scrapeUrl'], t['health'], t.get('lastError','')) for t in json.load(sys.stdin)['data']['activeTargets']]"

# Manually test the target endpoint
curl -v http://192.168.1.10:9100/metrics | head -20

# Check the Prometheus logs
docker compose logs prometheus --tail=50 | grep -i error

# Test connectivity from the Prometheus container
docker compose exec prometheus wget -qO- http://node-exporter:9100/metrics | head -5

Grafana is not loading data

# Check that Prometheus responds
curl -s http://localhost:9090/api/v1/query?query=up | python3 -m json.tool

# Test the datasource from the Grafana UI
# Configuration > Data Sources > Prometheus > Save & Test

# Check the Grafana logs
docker compose logs grafana --tail=50 | grep -i error

AlertManager is not receiving alerts

# Check the Prometheus -> AlertManager connection
curl -s http://localhost:9090/api/v1/alertmanagers | python3 -m json.tool

# See the pending alerts in Prometheus
curl -s http://localhost:9090/api/v1/alerts | python3 -m json.tool

# Check the AlertManager config
docker compose exec alertmanager amtool check-config /etc/alertmanager/alertmanager.yml

# Test sending a manual alert
curl -XPOST http://localhost:9093/api/v2/alerts -H "Content-Type: application/json" -d '[{
  "labels": {"alertname": "TestAlert", "severity": "warning"},
  "annotations": {"summary": "Test from curl"}
}]'
Operational stack
Your monitoring infrastructure is complete. Start by importing the 1860 dashboard into Grafana to immediately get a comprehensive view of your Linux servers. Then gradually add your alerting rules and your business exporters as needed.

Conclusion

You now have a production-grade monitoring stack that covers the entire observability cycle:

  • Prometheus collects and stores metrics from all your servers via a reliable pull model, with automatic detection of downed instances.
  • Node Exporter + Blackbox Exporter expose system metrics and monitor your HTTP endpoints and SSL certificates.
  • PromQL enables expressive queries to compute CPU, RAM, disk, error rate and latency percentiles.
  • Grafana visualizes everything in interactive dashboards with variables, annotations and native alerting. The 1860 dashboard covers 95% of needs right after import.
  • AlertManager routes alerts intelligently with grouping, inhibitions and silences to avoid alert storms.
  • Thanos or Mimir extend the stack for high availability and long-term retention as the infrastructure grows.

To go further, explore Loki (log centralization, same Grafana stack), Tempo (distributed tracing) and Pyroscope (continuous profiling) to complete the three pillars of observability: metrics, logs and traces.

Written by

Morgann Riu

Cybersecurity and Linux administration expert. I share my knowledge through free tutorials and training to help system administrators and developers secure their infrastructures.

Share this tutorial

Did you enjoy this article?

Comments

Checklist Sécurité Linux

30 points essentiels pour sécuriser un serveur Linux. Recevez aussi les nouveaux tutoriels par email.

Pas de spam. Désabonnement en 1 clic.