What is the difference between Prometheus and Zabbix for monitoring?

Prometheus is designed for the cloud-native and Kubernetes world: pull model, standardized text-based metrics format, expressive PromQL, native integration with Docker and Kubernetes. It excels in dynamic environments where targets appear and disappear. Zabbix is a traditional monitoring solution that supports both push and pull, ships with a powerful native agent for Windows and Linux, and offers a more complete web interface out of the box. Prometheus is the clear choice in containerized and cloud-native infrastructures; Zabbix remains relevant in mixed Windows/Linux environments or for teams that want a turnkey solution without building their own Grafana stack.

Why does Prometheus use a pull model rather than push?

The pull model offers several structural advantages. Prometheus controls the collection frequency, which makes it possible to detect unreachable instances (the "up" metric drops to 0). The monitoring configuration is centralized in Prometheus rather than scattered across every agent. You can manually validate the /metrics endpoints from any machine without deploying an agent. The pull model is also simpler to secure: only Prometheus needs network access to the targets. The trade-off is that short-lived jobs (batch, cron) that run between two scrapes are invisible: for those cases, the Pushgateway lets you push metrics before the job ends.

How does PromQL work and what are the most important functions?

PromQL operates on two data types: instant vectors (an instantaneous value per time series) and range vectors (a set of values over a time window, written [5m]). The essential functions are rate() and irate() to compute the per-second rate of change of monotonic counters (rate() smooths over the window, irate() uses the last two points to detect spikes), increase() to obtain the total increase over a period, histogram_quantile() to compute percentiles from a histogram, topk() and bottomk() to select the N highest or lowest series, and the sum/avg/max/min by (label) aggregations to group series by dimension. The basic rule: always use rate() or increase() on counters (suffix _total), never a direct difference.

How does AlertManager avoid alert storms?

AlertManager has three complementary mechanisms. Grouping (group_by) aggregates alerts sharing the same labels into a single notification: if 20 instances go down at the same time, you receive a single grouped message. group_wait defines how long to wait before sending the first notification (to aggregate alerts that arrive together), group_interval defines how often to resend for a group that keeps alerting, and repeat_interval controls how often to remind for an unresolved alert. Inhibitions (inhibit_rules) let you suppress low-severity alerts when a critical alert exists on the same instance (don't send "high CPU" if the instance is already "DOWN"). Silences let you temporarily disable alerts during planned maintenance.

What is Thanos and when should it be used instead of Prometheus alone?

Thanos addresses the limitations of Prometheus for large infrastructures. It adds high availability (several Prometheus instances with metric deduplication), long-term retention by storing TSDB blocks on S3, GCS or Azure Blob Storage (unlimited retention versus the 15-day default for Prometheus), and a unified global view of several Prometheus instances via Thanos Query. You should consider it when you have more than one Prometheus instance (multi-datacenter, multi-cluster), when 15-30 days of retention is not enough (compliance, capacity planning), or when the metric volume exceeds the capacity of a single server. Grafana Mimir is the open-source alternative from Grafana Labs offering similar features with an architecture that is simpler to operate.

How do you secure access to Prometheus and Grafana in production?

For Prometheus, enable native basic authentication via --web.config.file with bcrypt passwords, or place an Nginx/Traefik reverse proxy in front with Let's Encrypt TLS. Ports 9090, 9093 and 9100 should never be exposed on the Internet: use a firewall (UFW) to restrict access to internal IPs only. For Grafana, enable HTTPS, disable allow_sign_up and anonymous access in grafana.ini, configure OAuth SSO (GitHub, Google, Keycloak) to centralize authentication, and use the role system (Admin, Editor, Viewer) for access control. Enable Grafana audit logging to trace who changed what. In multi-tenant production, Grafana supports organizations to isolate teams.

What is the Blackbox Exporter and how do you monitor SSL certificate expiry?

The Blackbox Exporter is a Prometheus exporter that performs external probes: HTTP checks (status code, content, SSL certificate), TCP, DNS and ICMP. To monitor SSL certificates, it exposes the metric probe_ssl_earliest_cert_expiry, which gives the certificate's expiry timestamp in Unix seconds. A classic alert is: (probe_ssl_earliest_cert_expiry - time()) / 86400 < 30, which fires when fewer than 30 days remain before expiry. This lets you monitor all your domains from a single exporter and be alerted well before expiry, without relying on manual checks. The Blackbox Exporter also tests the external availability of your services, complementing the Node Exporter which monitors the internal side.

Prometheus & Grafana: Complete Monitoring for Your Linux Infrastructure | Morgann Riu

Back to tutorials

Prerequisites
Docker and Docker Compose installed on your server. Basic Linux administration knowledge (systemd, YAML configuration files). For installing Docker, see the Docker installation guide.

Why Prometheus? Pull vs Push, the real architecture

Monitoring a Linux infrastructure long boiled down to Nagios and its plugins, or Zabbix and its agent. These tools work on a push model: each agent sends its data to a central collector. Prometheus reverses this paradigm with the pull model: it is the Prometheus server that periodically queries each target on its /metrics endpoint.

This reversal is not cosmetic. It fundamentally changes how monitoring is operated:

Detecting dead instances: if Prometheus cannot scrape a target, the up metric drops to 0. With push, a silent instance is undetectable.
Centralized configuration: all targets are defined in Prometheus, not scattered across each agent. A single YAML file to see the entire infrastructure.
Simple debugging: the /metrics endpoint is plain text readable with curl, with no intermediate agent to debug.
Standardized ecosystem: the Prometheus exposition format has become an OpenMetrics standard adopted by hundreds of applications (Nginx, PostgreSQL, Redis, Kubernetes, etc.).

The four components of the stack


┌─────────────────┐   scrape /metrics   ┌──────────────────┐   alerts     ┌─────────────────┐
│  Node Exporter  │ ◄─────────────────── │   Prometheus     │ ──────────►  │  AlertManager   │
│  (port 9100)    │                      │   (port 9090)    │              │  (port 9093)    │
│  Nginx Exporter │ ◄─────────────────── │   local TSDB     │              │  Slack / Email  │
│  (port 9113)    │                      │   PromQL         │              │  PagerDuty      │
│  Blackbox Exp.  │ ◄─────────────────── │   Alerting rules │              └─────────────────┘
│  (port 9115)    │                      └──────────────────┘
└─────────────────┘                               │
                                                  │ datasource
                                                  ▼
                                       ┌──────────────────┐
                                       │     Grafana       │
                                       │   (port 3000)    │
                                       │   Dashboards     │
                                       │   Alerting       │
                                       └──────────────────┘

Prometheus: collects (scrapes) metrics, stores them in a local time-series database (TSDB), and evaluates alerting rules every 15 seconds.
Exporters: translate system or application metrics into the Prometheus format. Node Exporter for Linux, Nginx Exporter for Nginx, Blackbox Exporter for HTTP/SSL probes.
AlertManager: receives alerts from Prometheus, deduplicates them, groups them, and routes them to the right channels with silence and inhibition handling.
Grafana: visualization interface that queries Prometheus via PromQL to display interactive dashboards. It stores nothing and collects nothing.

Installation: complete Docker Compose stack

Deploying the full stack with Docker Compose is the most reproducible method. A single file describes the entire monitoring infrastructure.

File structure

mkdir -p ~/monitoring/{prometheus,grafana,alertmanager}
mkdir -p ~/monitoring/prometheus/rules
mkdir -p ~/monitoring/grafana/{provisioning/datasources,provisioning/dashboards,dashboards}
cd ~/monitoring

Complete Docker Compose

# ~/monitoring/docker-compose.yml
version: '3.8'

networks:
  monitoring:
    driver: bridge

volumes:
  prometheus_data:
    driver: local
  grafana_data:
    driver: local
  alertmanager_data:
    driver: local

services:

  # ─── Prometheus ────────────────────────────────────────────────
  prometheus:
    image: prom/prometheus:v2.53.0
    container_name: prometheus
    restart: unless-stopped
    user: "65534:65534"           # nobody:nobody, no root
    ports:
      - "127.0.0.1:9090:9090"    # Listen on localhost only
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./prometheus/rules:/etc/prometheus/rules:ro
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
      - '--storage.tsdb.retention.size=10GB'
      - '--web.enable-lifecycle'
      - '--web.enable-admin-api'
      - '--web.console.libraries=/usr/share/prometheus/console_libraries'
      - '--web.console.templates=/usr/share/prometheus/consoles'
    networks:
      - monitoring
    healthcheck:
      test: ["CMD", "wget", "--quiet", "--tries=1", "--spider", "http://localhost:9090/-/healthy"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 30s

  # ─── Node Exporter ─────────────────────────────────────────────
  node-exporter:
    image: prom/node-exporter:v1.8.2
    container_name: node-exporter
    restart: unless-stopped
    pid: host                    # Access to host metrics
    ports:
      - "127.0.0.1:9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.rootfs=/rootfs'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
      - '--collector.systemd'
    networks:
      - monitoring

  # ─── AlertManager ──────────────────────────────────────────────
  alertmanager:
    image: prom/alertmanager:v0.27.0
    container_name: alertmanager
    restart: unless-stopped
    user: "65534:65534"
    ports:
      - "127.0.0.1:9093:9093"
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
      - alertmanager_data:/alertmanager
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--storage.path=/alertmanager'
      - '--web.external-url=http://localhost:9093'
      - '--cluster.listen-address='      # Disable clustering for a single instance
    networks:
      - monitoring
    depends_on:
      - prometheus

  # ─── Grafana ───────────────────────────────────────────────────
  grafana:
    image: grafana/grafana:11.1.0
    container_name: grafana
    restart: unless-stopped
    user: "472:472"
    ports:
      - "127.0.0.1:3000:3000"
    environment:
      GF_SECURITY_ADMIN_USER: admin
      GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_ADMIN_PASSWORD:-changeme}
      GF_USERS_ALLOW_SIGN_UP: "false"
      GF_ANALYTICS_REPORTING_ENABLED: "false"
      GF_SERVER_ROOT_URL: https://grafana.example.com
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning:ro
      - ./grafana/dashboards:/var/lib/grafana/dashboards:ro
    networks:
      - monitoring
    depends_on:
      - prometheus
    healthcheck:
      test: ["CMD-SHELL", "wget --quiet --tries=1 --spider http://localhost:3000/api/health || exit 1"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 60s

  # ─── Blackbox Exporter ─────────────────────────────────────────
  blackbox-exporter:
    image: prom/blackbox-exporter:v0.25.0
    container_name: blackbox-exporter
    restart: unless-stopped
    ports:
      - "127.0.0.1:9115:9115"
    volumes:
      - ./prometheus/blackbox.yml:/etc/blackbox_exporter/config.yml:ro
    networks:
      - monitoring

Network security
Note that all ports are bound to 127.0.0.1 only. Never expose Prometheus, AlertManager or Node Exporter directly on the Internet. Use an Nginx reverse proxy with TLS to access Grafana from outside.

Prometheus configuration

prometheus.yml: scrape_configs and service discovery

# ~/monitoring/prometheus/prometheus.yml
global:
  scrape_interval: 15s          # Collect every 15 seconds
  evaluation_interval: 15s      # Evaluate rules every 15s
  scrape_timeout: 10s           # Timeout per scrape

  # Labels added to all metrics from this instance
  external_labels:
    datacenter: 'paris-1'
    environment: 'production'

# Loading the alerting rules
rule_files:
  - "/etc/prometheus/rules/*.yml"

# AlertManager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]
      timeout: 10s

scrape_configs:
  # ── Prometheus itself ──────────────────────────────────────────
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # ── Node Exporter ──────────────────────────────────────────────
  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']
        labels:
          hostname: 'monitoring-server'
          role: 'monitoring'

  # ── Remote servers (static targets) ────────────────────────────
  - job_name: 'servers'
    scrape_interval: 30s        # Override per job
    static_configs:
      - targets:
          - '192.168.1.10:9100'
          - '192.168.1.11:9100'
          - '192.168.1.12:9100'
        labels:
          environment: 'production'
          role: 'web'
      - targets:
          - '192.168.1.20:9100'
          - '192.168.1.21:9100'
        labels:
          environment: 'production'
          role: 'database'

  # ── Blackbox Exporter (HTTP probes) ───────────────────────────
  - job_name: 'blackbox-http'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - 'https://example.com'
          - 'https://api.example.com/health'
          - 'https://grafana.example.com'
    relabel_configs:
      # Copy the target URL into the ?target= parameter
      - source_labels: [__address__]
        target_label: __param_target
      # Use the URL as the "instance" label
      - source_labels: [__param_target]
        target_label: instance
      # Redirect the scrape to the Blackbox Exporter
      - target_label: __address__
        replacement: blackbox-exporter:9115

  # ── Blackbox Exporter (SSL expiry) ────────────────────────────
  - job_name: 'blackbox-ssl'
    metrics_path: /probe
    params:
      module: [ssl_expiry]
    static_configs:
      - targets:
          - 'example.com:443'
          - 'api.example.com:443'
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

  # ── File-based service discovery (for dynamic infra) ──────────
  - job_name: 'dynamic-servers'
    file_sd_configs:
      - files:
          - /etc/prometheus/targets/*.json
        refresh_interval: 1m    # Reload the list every minute

Blackbox Exporter configuration

# ~/monitoring/prometheus/blackbox.yml
modules:
  # HTTP check with status code 200
  http_2xx:
    prober: http
    timeout: 5s
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
      valid_status_codes: [200, 201, 204]
      method: GET
      follow_redirects: true
      preferred_ip_protocol: "ip4"
      tls_config:
        insecure_skip_verify: false

  # SSL expiry check only
  ssl_expiry:
    prober: http
    timeout: 5s
    http:
      method: HEAD
      fail_if_not_ssl: true
      tls_config:
        insecure_skip_verify: false

  # ICMP ping
  icmp:
    prober: icmp
    timeout: 5s
    icmp:
      preferred_ip_protocol: "ip4"

  # TCP port check
  tcp_connect:
    prober: tcp
    timeout: 5s

PromQL: essential queries, annotated

PromQL operates on two kinds of selectors: instant vectors (an instantaneous value) and range vectors (a set of values over a window, written [5m]). The golden rule: always use rate() or increase() on counters (metrics suffixed with _total), never a raw value.

5 essential queries for monitoring a Linux server

# ─── 1. CPU usage as a percentage ─────────────────────────────────────────────
# Principle: 100% - % of time spent in "idle" mode
# rate() computes the per-second rate of change over 5 minutes (smooths spikes)
# avg by (instance) aggregates all CPU cores per server
100 - (
  avg by (instance) (
    rate(node_cpu_seconds_total{mode="idle"}[5m])
  ) * 100
)

# ─── 2. Memory used as a percentage ───────────────────────────────────────────
# MemAvailable includes reclaimable memory (caches) = truly free memory
# More reliable than (MemTotal - MemFree), which ignores Linux caches
(1 - (
  node_memory_MemAvailable_bytes /
  node_memory_MemTotal_bytes
)) * 100

# ─── 3. Disk space used as a percentage ───────────────────────────────────────
# Filtered on the root filesystem, adjust to your mount points
# node_filesystem_avail_bytes = space available to non-root users
(1 - (
  node_filesystem_avail_bytes{mountpoint="/", fstype!="tmpfs"}
  /
  node_filesystem_size_bytes{mountpoint="/", fstype!="tmpfs"}
)) * 100

# ─── 4. Inbound network traffic (bits/s) ──────────────────────────────────────
# irate() uses the last two points to detect instantaneous spikes
# Multiply by 8 to convert bytes → bits
# Adjust "eth0" to your main network interface
irate(node_network_receive_bytes_total{device="eth0"}[5m]) * 8

# ─── 5. 95th percentile of HTTP request duration ──────────────────────────────
# histogram_quantile reconstructs percentiles from the buckets
# Requires an exporter that exposes histograms (nginx, traefik, etc.)
# the "le" label means "less than or equal to" (the bucket's upper bound)
histogram_quantile(0.95,
  sum by (le, job) (
    rate(http_request_duration_seconds_bucket[5m])
  )
)

Advanced queries: topk, prediction, error rate

# Top 5 processes by CPU consumption
topk(5,
  sum by (groupname) (
    rate(namedprocess_namegroup_cpu_seconds_total[5m])
  )
)

# Prediction: in how many hours will the disk be full?
# predict_linear projects the trend of the last 6 hours
# Alert if it is predicted to fill within less than 48h
predict_linear(
  node_filesystem_avail_bytes{mountpoint="/"}[6h],
  48 * 3600
) < 0

# HTTP error rate (4xx + 5xx) as a percentage
(
  sum(rate(nginx_http_requests_total{status=~"[45].."}[5m]))
  /
  sum(rate(nginx_http_requests_total[5m]))
) * 100

# Availability over 24h (for an SLO dashboard)
avg_over_time(up{job="servers"}[24h]) * 100

Prometheus alerting rules

# ~/monitoring/prometheus/rules/node_alerts.yml
groups:
  - name: instance_availability
    interval: 15s              # Override of the global evaluation_interval
    rules:
      # ── Instance unreachable ─────────────────────────────────────────────────
      - alert: InstanceDown
        expr: up == 0
        for: 2m                # Fires only if DOWN for 2 minutes
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} unreachable"
          description: "The job {{ $labels.job }} on {{ $labels.instance }} has not responded for 2 minutes. Check the state of the server and the service."
          runbook: "https://wiki.example.com/runbooks/instance-down"

  - name: system_resources
    rules:
      # ── High CPU ─────────────────────────────────────────────────────────────
      - alert: HighCpuUsage
        expr: |
          100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Critical CPU on {{ $labels.instance }}"
          description: "CPU usage of {{ printf \"%.1f\" $value }}% on {{ $labels.instance }} for 5 minutes."

      - alert: HighCpuUsageWarning
        expr: |
          100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 75
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High CPU on {{ $labels.instance }}"
          description: "CPU usage of {{ printf \"%.1f\" $value }}% on {{ $labels.instance }} for 10 minutes."

      # ── Critical memory ──────────────────────────────────────────────────────
      - alert: HighMemoryUsage
        expr: |
          (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 90
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Critical memory on {{ $labels.instance }}"
          description: "{{ printf \"%.1f\" $value }}% of memory used on {{ $labels.instance }}."

      # ── Disk > 85% ────────────────────────────────────────────────────────────
      - alert: DiskSpaceWarning
        expr: |
          (1 - node_filesystem_avail_bytes{mountpoint="/", fstype!="tmpfs"} /
               node_filesystem_size_bytes{mountpoint="/", fstype!="tmpfs"}) * 100 > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "{{ printf \"%.1f\" $value }}% of the disk used on {{ $labels.instance }} (/ partition)."

      - alert: DiskSpaceCritical
        expr: |
          (1 - node_filesystem_avail_bytes{mountpoint="/", fstype!="tmpfs"} /
               node_filesystem_size_bytes{mountpoint="/", fstype!="tmpfs"}) * 100 > 95
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Disk almost full on {{ $labels.instance }}"
          description: "{{ printf \"%.1f\" $value }}% of the disk used on {{ $labels.instance }}. Urgent action required."

      # ── Disk full within 48h ─────────────────────────────────────────────────
      - alert: DiskWillFillIn48h
        expr: |
          predict_linear(node_filesystem_avail_bytes{mountpoint="/", fstype!="tmpfs"}[6h], 48 * 3600) < 0
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Disk predicted full within 48h on {{ $labels.instance }}"
          description: "Based on the trend of the last 6 hours, the disk on {{ $labels.instance }} will be full in less than 48 hours."

  - name: ssl_certificates
    rules:
      # ── SSL certificate expiring within 30 days ──────────────────────────────
      - alert: SslCertificateExpiringSoon
        expr: |
          (probe_ssl_earliest_cert_expiry - time()) / 86400 < 30
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "SSL certificate expiring soon for {{ $labels.instance }}"
          description: "The SSL certificate for {{ $labels.instance }} expires in {{ printf \"%.0f\" $value }} days."

      - alert: SslCertificateExpired
        expr: |
          (probe_ssl_earliest_cert_expiry - time()) / 86400 < 7
        for: 1h
        labels:
          severity: critical
        annotations:
          summary: "Critical SSL certificate for {{ $labels.instance }}"
          description: "The SSL certificate for {{ $labels.instance }} expires in {{ printf \"%.0f\" $value }} days. Immediate action required."

  - name: http_availability
    rules:
      # ── HTTP endpoint unreachable ────────────────────────────────────────────
      - alert: HttpEndpointDown
        expr: probe_success == 0
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "HTTP endpoint unreachable: {{ $labels.instance }}"
          description: "The URL {{ $labels.instance }} has not responded for 3 minutes."

AlertManager: routes, receivers, inhibitions

# ~/monitoring/alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m

  # Global SMTP configuration
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_from: '[email protected]'
  smtp_auth_username: '[email protected]'
  smtp_auth_password: 'your-app-password'
  smtp_require_tls: true

# ─── Routing tree ───────────────────────────────────────────────────────────────
route:
  # Group by alertname + datacenter to avoid duplicates
  group_by: ['alertname', 'datacenter', 'job']
  group_wait: 30s              # Wait 30s to group the initial alerts
  group_interval: 5m           # Interval between notifications for an active group
  repeat_interval: 4h          # Repeat if the alert persists
  receiver: 'email-ops'        # Default receiver

  routes:
    # Critical alerts → Slack as priority + frequent repeat
    - match:
        severity: critical
      receiver: 'slack-critical'
      group_wait: 10s           # Less wait for critical alerts
      repeat_interval: 30m
      continue: true            # Continue to the other routes (email too)

    # Critical alerts → email as well
    - match:
        severity: critical
      receiver: 'email-ops'

    # Warning alerts → separate Slack channel
    - match:
        severity: warning
      receiver: 'slack-warning'
      repeat_interval: 6h

# ─── Receivers ────────────────────────────────────────────────────────────────
receivers:
  - name: 'email-ops'
    email_configs:
      - to: '[email protected]'
        subject: '[{{ .Status | toUpper }}] {{ .CommonLabels.alertname }} - {{ .CommonAnnotations.summary }}'
        body: |
          {{ range .Alerts }}
          Alert: {{ .Annotations.summary }}
          Description: {{ .Annotations.description }}
          Severity: {{ .Labels.severity }}
          Instance: {{ .Labels.instance }}
          Started: {{ .StartsAt.Format "2006-01-02 15:04:05" }}
          {{ end }}
        send_resolved: true
        headers:
          X-Priority: '1'

  - name: 'slack-critical'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/TXXXXXXXX/BXXXXXXXX/xxxxxxxxxxxxxxxxxxxxxxxx'
        channel: '#critical-alerts'
        username: 'AlertManager'
        icon_emoji: ':rotating_light:'
        title: '[{{ .Status | toUpper }}] {{ .CommonLabels.alertname }}'
        text: |
          {{ range .Alerts }}
          *{{ .Annotations.summary }}*
          {{ .Annotations.description }}
          *Instance:* {{ .Labels.instance }}
          *Datacenter:* {{ .Labels.datacenter }}
          {{ end }}
        color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
        send_resolved: true

  - name: 'slack-warning'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/TXXXXXXXX/BXXXXXXXX/xxxxxxxxxxxxxxxxxxxxxxxx'
        channel: '#monitoring-alerts'
        username: 'AlertManager'
        icon_emoji: ':warning:'
        title: '[WARNING] {{ .CommonLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ .Annotations.description }}{{ end }}'
        send_resolved: true

# ─── Inhibitions ──────────────────────────────────────────────────────────────
inhibit_rules:
  # If the instance is DOWN, suppress the CPU/RAM/Disk alerts for the same instance
  - source_match:
      alertname: 'InstanceDown'
    target_match_re:
      alertname: '(HighCpuUsage|HighMemoryUsage|DiskSpaceWarning|DiskSpaceCritical)'
    equal: ['instance']

  # If a critical alert exists, suppress the warnings with the same alertname
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']

Grafana: dashboards, variables and provisioning as code

Automatically provisioned datasource

Grafana provisioning lets you deploy datasources and dashboards without going through the web interface. Configuration as code, versioned in Git.

# ~/monitoring/grafana/provisioning/datasources/prometheus.yml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    uid: prometheus-uid         # Fixed UID to reference in dashboards
    editable: false
    jsonData:
      httpMethod: POST
      prometheusType: Prometheus
      prometheusVersion: 2.53.0
      timeInterval: 15s
      queryTimeout: 60s

Dashboard provisioning

# ~/monitoring/grafana/provisioning/dashboards/default.yml
apiVersion: 1

providers:
  - name: 'default'
    orgId: 1
    folder: 'Production'
    type: file
    disableDeletion: true       # Prevents deletion via the UI
    updateIntervalSeconds: 30   # Reloads the files every 30s
    allowUiUpdates: false       # UI changes do not persist
    options:
      path: /var/lib/grafana/dashboards

Dashboard as code: simplified Node Exporter

{
  "title": "Infrastructure Overview",
  "uid": "infra-overview",
  "tags": ["production", "node-exporter"],
  "refresh": "30s",
  "time": { "from": "now-3h", "to": "now" },
  "templating": {
    "list": [
      {
        "name": "instance",
        "type": "query",
        "label": "Server",
        "datasource": "Prometheus",
        "query": "label_values(node_uname_info, instance)",
        "refresh": 2,
        "multi": false,
        "includeAll": true,
        "allValue": ".*"
      }
    ]
  },
  "panels": [
    {
      "type": "stat",
      "title": "CPU Usage",
      "gridPos": { "x": 0, "y": 0, "w": 6, "h": 4 },
      "targets": [{
        "expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode='idle', instance=~'$instance'}[5m])) * 100)",
        "legendFormat": "{{ instance }}"
      }],
      "options": { "reduceOptions": { "calcs": ["lastNotNull"] } },
      "fieldConfig": {
        "defaults": {
          "unit": "percent",
          "thresholds": {
            "mode": "absolute",
            "steps": [
              { "color": "green", "value": null },
              { "color": "yellow", "value": 75 },
              { "color": "red", "value": 90 }
            ]
          }
        }
      }
    }
  ]
}

Importing the Node Exporter Full dashboard (ID 1860)

Grafana provides a catalog of community dashboards. The 1860 dashboard (Node Exporter Full) covers all system metrics in a single import:

Go to Dashboards > New > Import
Enter the ID 1860 in the "Import via grafana.com" field
Select your Prometheus datasource from the dropdown menu
Click Import

Other useful dashboards: 13978 (Node Exporter Quickstart), 7587 (Docker monitoring), 9614 (Nginx), 9628 (PostgreSQL).

Dashboard variables

Variables turn a static dashboard into an interactive, multi-server tool. Create an instance variable via Dashboard Settings > Variables > New variable:

Type: Query
Query: label_values(node_uname_info, instance)
Refresh: On time range change
Multi-value: enabled to compare several servers

Then use $instance in all your queries: node_cpu_seconds_total{instance=~"$instance"}. The filter applies to all panels simultaneously.

Automatic annotations from AlertManager

Grafana annotations let you overlay alert events on graphs. Configure an automatic annotation in the dashboard settings:

Source: Prometheus
Query: ALERTS{alertstate="firing"}
Title: {{alertname}}
Tags: {{severity}}

Additional exporters

Nginx Exporter

# Enable stub_status in Nginx
# /etc/nginx/conf.d/stub_status.conf
server {
    listen 127.0.0.1:8080;
    location /nginx_status {
        stub_status on;
        access_log off;
        allow 127.0.0.1;
        deny all;
    }
}

# Add to docker-compose.yml
  nginx-exporter:
    image: nginx/nginx-prometheus-exporter:1.1.0
    container_name: nginx-exporter
    restart: unless-stopped
    ports:
      - "127.0.0.1:9113:9113"
    command:
      - -nginx.scrape-uri=http://host-gateway:8080/nginx_status
    networks:
      - monitoring
    extra_hosts:
      - "host-gateway:host-gateway"

PostgreSQL Exporter

# Add to docker-compose.yml
  postgres-exporter:
    image: prometheuscommunity/postgres-exporter:v0.15.0
    container_name: postgres-exporter
    restart: unless-stopped
    environment:
      DATA_SOURCE_NAME: "postgresql://exporter:password@postgres:5432/postgres?sslmode=disable"
    ports:
      - "127.0.0.1:9187:9187"
    networks:
      - monitoring

-- Create the PostgreSQL user for the exporter
CREATE USER exporter WITH PASSWORD 'password';
ALTER USER exporter SET SEARCH_PATH TO exporter,pg_catalog;
GRANT CONNECT ON DATABASE postgres TO exporter;
GRANT pg_monitor TO exporter;

Adding the exporters to Prometheus

# Add to prometheus.yml / scrape_configs
  - job_name: 'nginx'
    static_configs:
      - targets: ['nginx-exporter:9113']

  - job_name: 'postgresql'
    static_configs:
      - targets: ['postgres-exporter:9187']

Startup and operations

# Start the whole stack
cd ~/monitoring
docker compose up -d

# Check the state of all services
docker compose ps

# View logs in real time
docker compose logs -f prometheus

# Validate the Prometheus configuration before reload
docker compose exec prometheus promtool check config /etc/prometheus/prometheus.yml

# Validate the alerting rules
docker compose exec prometheus promtool check rules /etc/prometheus/rules/node_alerts.yml

# Reload Prometheus without restarting (hot reload)
curl -X POST http://localhost:9090/-/reload

# Check the active targets
curl -s http://localhost:9090/api/v1/targets | python3 -m json.tool | grep -E "health|job|instance"

# List active alerts in AlertManager
curl -s http://localhost:9093/api/v2/alerts | python3 -m json.tool

High availability: Thanos or Grafana Mimir

Prometheus alone has two limitations in critical production: a single instance (SPOF), and retention limited by local disk capacity (15-30 days recommended). Thanos and Grafana Mimir solve both problems.

Thanos architecture


Prometheus instance 1 ──── Thanos Sidecar ────┐
                                               │
Prometheus instance 2 ──── Thanos Sidecar ────┤──► Thanos Query ──► Grafana
                                               │       (dedup)
Prometheus HA pair         S3 / Minio ◄────────┘
                           (long-term retention)   Thanos Store
                                                   (query S3)

Deploying the Thanos Sidecar

# Docker Compose extension for Thanos
  thanos-sidecar:
    image: thanosio/thanos:v0.35.0
    container_name: thanos-sidecar
    command:
      - sidecar
      - --prometheus.url=http://prometheus:9090
      - --tsdb.path=/prometheus
      - --grpc-address=0.0.0.0:10901
      - --http-address=0.0.0.0:10902
      - --objstore.config-file=/etc/thanos/s3.yml
    volumes:
      - prometheus_data:/prometheus
      - ./thanos/s3.yml:/etc/thanos/s3.yml:ro
    networks:
      - monitoring

  thanos-query:
    image: thanosio/thanos:v0.35.0
    container_name: thanos-query
    command:
      - query
      - --http-address=0.0.0.0:9091
      - --endpoint=thanos-sidecar:10901
      - --query.replica-label=replica
    ports:
      - "127.0.0.1:9091:9091"
    networks:
      - monitoring

# ~/monitoring/thanos/s3.yml
type: S3
config:
  bucket: monitoring-thanos
  endpoint: s3.eu-west-3.amazonaws.com
  region: eu-west-3
  access_key: AKIAXXXXXXXXXXXXXXXX
  secret_key: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Grafana Mimir: the simplified alternative
Grafana Mimir offers the same features as Thanos (HA, S3 storage, unlimited retention) with a monolithic architecture that is simpler to deploy. It natively accepts Prometheus remote_write via the Prometheus protocol. Ideal if you are getting started with Prometheus HA.

Retention and sizing

Targets	Metrics/instance	Recommended RAM	Disk (30d)
5 servers	~1,000	512 MB	5 GB
20 servers	~1,000	2 GB	20 GB
100 servers	~1,000	8 GB	100 GB
1,000 servers	~1,000	32 GB	→ Thanos/Mimir

Troubleshooting

Prometheus is not scraping a target

# List the targets and their state
curl -s http://localhost:9090/api/v1/targets | \
  python3 -c "import sys,json; [print(t['scrapeUrl'], t['health'], t.get('lastError','')) for t in json.load(sys.stdin)['data']['activeTargets']]"

# Manually test the target endpoint
curl -v http://192.168.1.10:9100/metrics | head -20

# Check the Prometheus logs
docker compose logs prometheus --tail=50 | grep -i error

# Test connectivity from the Prometheus container
docker compose exec prometheus wget -qO- http://node-exporter:9100/metrics | head -5

Grafana is not loading data

# Check that Prometheus responds
curl -s http://localhost:9090/api/v1/query?query=up | python3 -m json.tool

# Test the datasource from the Grafana UI
# Configuration > Data Sources > Prometheus > Save & Test

# Check the Grafana logs
docker compose logs grafana --tail=50 | grep -i error

AlertManager is not receiving alerts

# Check the Prometheus -> AlertManager connection
curl -s http://localhost:9090/api/v1/alertmanagers | python3 -m json.tool

# See the pending alerts in Prometheus
curl -s http://localhost:9090/api/v1/alerts | python3 -m json.tool

# Check the AlertManager config
docker compose exec alertmanager amtool check-config /etc/alertmanager/alertmanager.yml

# Test sending a manual alert
curl -XPOST http://localhost:9093/api/v2/alerts -H "Content-Type: application/json" -d '[{
  "labels": {"alertname": "TestAlert", "severity": "warning"},
  "annotations": {"summary": "Test from curl"}
}]'

Operational stack
Your monitoring infrastructure is complete. Start by importing the 1860 dashboard into Grafana to immediately get a comprehensive view of your Linux servers. Then gradually add your alerting rules and your business exporters as needed.

Conclusion

You now have a production-grade monitoring stack that covers the entire observability cycle:

Prometheus collects and stores metrics from all your servers via a reliable pull model, with automatic detection of downed instances.
Node Exporter + Blackbox Exporter expose system metrics and monitor your HTTP endpoints and SSL certificates.
PromQL enables expressive queries to compute CPU, RAM, disk, error rate and latency percentiles.
Grafana visualizes everything in interactive dashboards with variables, annotations and native alerting. The 1860 dashboard covers 95% of needs right after import.
AlertManager routes alerts intelligently with grouping, inhibitions and silences to avoid alert storms.
Thanos or Mimir extend the stack for high availability and long-term retention as the infrastructure grows.

To go further, explore Loki (log centralization, same Grafana stack), Tempo (distributed tracing) and Pyroscope (continuous profiling) to complete the three pillars of observability: metrics, logs and traces.

Prometheus Grafana Monitoring AlertManager PromQL Docker Node Exporter