DevOps 11/02/2026 12 min read

Linux Monitoring: The Essential Metrics to Watch in Production

A complete guide to the critical Linux metrics to monitor in production: CPU, memory, disk, network, processes and alerting. Commands, thresholds and best practices for sysadmins.

A Linux server running in production with no monitoring is like driving at night with no dashboard: you don't know how fast you're going, how much fuel is left, or whether the engine is overheating. By the time the problem becomes visible, it's often too late. The service is already down, users are complaining, and you're firefighting at 3 a.m.

Monitoring isn't a luxury reserved for large infrastructures. Even a 5-euro-a-month VPS deserves a minimum of oversight. The difference between a contained incident and a full-blown disaster often comes down to a single alert fired 15 minutes before the disk fills up or the OOM killer goes on a rampage.

In this article, we'll go through the fundamental metrics to monitor on a production Linux server. For each category, I give you the useful commands, the recommended alert thresholds and the pitfalls to avoid. The goal is pragmatic: you walk away with an actionable checklist.

CPU and system load

CPU load is the first metric you look at when a server is dragging. But be careful not to confuse CPU utilization with load average. CPU utilization measures the percentage of time the processor spends actually working. Load average, on the other hand, represents the average number of processes waiting to run over a given period (1, 5 and 15 minutes). A load average of 4.0 on a 4-core machine means the system is saturated. Beyond that, processes start queuing up.

The uptime command gives a quick overview, but for finer analysis, mpstat (from the sysstat package) breaks utilization down by core and by type (user, system, iowait, idle). The %iowait value is especially telling: a high figure means the CPU is waiting on the disk, which points to an I/O bottleneck rather than a lack of compute power.

# Quick load average
uptime
# 14:23:07 up 42 days, load average: 1.82, 2.15, 1.93

# Per-core detail, refreshed every 2 seconds
mpstat -P ALL 2

# Top 10 most CPU-hungry processes
ps aux --sort=-%cpu | head -11

# Real-time view sorted by CPU
top -bn1 -o %CPU | head -20

Practical tip: Set up an alert when the load average exceeds 0.7 times the number of cores for more than 5 minutes. For example, on a 4-core server, alert from a sustained load average of 2.8. That leaves some margin before complete saturation.

Memory and swap

Memory management under Linux is often misunderstood. Seeing 95% of RAM used isn't necessarily a problem: Linux aggressively uses available memory as a disk cache, which speeds up reads. What matters is available memory (the available column in free -h), not free memory (free). Available memory includes caches the kernel can release instantly if an application needs them.

Swap, on the other hand, is a warning sign. If the system is actively swapping (which you can check with vmstat, columns si and so), performance degrades dramatically because the disk is thousands of times slower than RAM. The vm.swappiness parameter controls how aggressively the system swaps: a value of 10 is appropriate for most production servers, versus the default of 60.

The worst-case scenario is the OOM killer. When the kernel runs out of available memory, it kills the most memory-hungry process. And it's not always the one you would have chosen. Watch the logs to catch these events before they become recurring.

# Memory and swap overview
free -h
#               total        used        free      shared  buff/cache   available
# Mem:           15Gi       8.2Gi       512Mi       256Mi       6.8Gi       6.5Gi
# Swap:         2.0Gi       128Mi       1.9Gi

# Real-time swap activity (si/so = swap in/out per second)
vmstat 2 5

# Check current swappiness
cat /proc/sys/vm/swappiness

# Lower swappiness (persistent via sysctl.conf)
echo "vm.swappiness=10" | sudo tee -a /etc/sysctl.d/99-tuning.conf
sudo sysctl -p /etc/sysctl.d/99-tuning.conf

# Detect OOM killer events in the logs
sudo dmesg | grep -i "oom|out of memory"
sudo journalctl -k | grep -i "oom"

Practical tip: Alert when available memory drops below 15% of total RAM for more than 10 minutes. And keep an eye on swap usage: any sustained swap activity (si+so > 0 for several minutes) warrants investigation.

Disk space and I/O

A full disk is probably the number one cause of avoidable outages. A log file that grows without rotation, a database filling up its partition, a /tmp saturated with forgotten temporary files: the scenarios are classic but keep tripping up experienced administrators. The df -h command shows partition usage, while du -sh lets you hunt down the largest directories.

A lesser-known pitfall: inode exhaustion. Even with free disk space, if the number of inodes is depleted (millions of small files, for example), no new file can be created. Check with df -i. On the I/O performance side, iostat reveals disk throughput and latency. An average latency (await) above 20ms on an SSD is suspicious.

To identify which process is consuming the most disk I/O, iotop is indispensable. It works like top but for disk operations. It's particularly useful when the CPU's %iowait is high and you're looking for the culprit.

# Disk space per partition
df -h

# Available inodes (often forgotten!)
df -i

# Find the 10 largest directories in /var
sudo du -sh /var/*/ 2>/dev/null | sort -rh | head -10

# Disk I/O statistics (2s interval, 5 samples)
iostat -xz 2 5

# Identify the most I/O-hungry processes
sudo iotop -oP -d 2

# Find files modified in the last 24h (handy for tracking down logs)
find /var/log -type f -mtime -1 -exec ls -lh {} ; | sort -k5 -rh | head -10

Practical tip: Set up two alert thresholds for the disk: a warning at 80% usage and a critical alert at 90%. Don't forget to also monitor inodes with df -i, especially on mail servers or systems with a lot of session files.

Network

Network monitoring covers several aspects: bandwidth consumed, number of active connections, interface errors and connections in suspicious states. The ss command (the modern replacement for netstat) is your Swiss Army knife. It displays TCP/UDP sockets with their states, queues and associated processes.

Connections in the TIME_WAIT state deserve special attention. A high count (several thousand) often indicates a configuration problem: HTTP connections not being reused, lack of keep-alive, or simply very high traffic. For real-time bandwidth, iftop shows traffic per connection, while nethogs aggregates it per process, which is more convenient for identifying which service is consuming bandwidth.

Also keep an eye on errors and dropped packets on network interfaces. A non-zero error rate can indicate a hardware problem (faulty cable, network card nearing end of life) or saturation of the network buffer.

# Number of TCP connections per state
ss -s

# Established connections, sorted by count per remote IP
ss -tn state established | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -rn | head -10

# Count TIME_WAIT connections
ss -tn state time-wait | wc -l

# Bandwidth per process
sudo nethogs eth0

# Interface statistics (errors, drops, overruns)
ip -s link show eth0

# Real-time network throughput
sar -n DEV 2 5

Practical tip: Alert if the number of TIME_WAIT connections exceeds 10,000 or if the dropped-packet rate (dropped) is above 0.1%. These symptoms often point to a sizing or network configuration problem that's only going to get worse.

Processes and services

Knowing which services are running and in what state is fundamental. With systemd, the systemctl command lists active, failed or masked units. A service in the failed state going unnoticed for days is a classic. Automate the check: a simple script that runs systemctl --failed and sends an alert if the output isn't empty is enough to avoid plenty of surprises.

Zombie processes are another indicator to watch. A zombie is a terminated process whose parent hasn't read its return code. A few zombies are harmless, but a growing number indicates a bug in the parent process. Finally, journalctl is indispensable for analyzing service logs. Use the per-unit and per-period filters so you don't drown in the noise.

# Failed services
systemctl --failed

# Status of a specific service with the last log lines
systemctl status nginx.service

# Logs of a service since the last hour
journalctl -u nginx.service --since "1 hour ago" --no-pager

# Count zombie processes
ps aux | awk '$8 == "Z" {count++} END {print "Zombies:", count+0}'

# Identify zombie processes and their parent
ps -eo pid,ppid,stat,cmd | awk '$3 ~ /Z/'

# Automatic check of critical services
for svc in nginx postgresql redis; do
    systemctl is-active --quiet $svc || echo "ALERT: $svc is down!"
done

Practical tip: Create a healthcheck script that checks your critical services every minute via cron. A simple systemctl is-active --quiet followed by a notification (email, Slack webhook) covers 80% of basic service monitoring needs.

Uptime and availability

Internal monitoring is necessary but not sufficient. If your server is unreachable from the internet, your local scripts won't be able to warn you. That's why external monitoring is essential: a third-party service regularly checks that your endpoints respond correctly. Tools like UptimeRobot (free for 50 monitors), Hetrixtools, or the open source Uptime Kuma project let you monitor HTTP availability, SSL certificates and response times.

The healthcheck endpoint concept is a best practice. Rather than simply checking that a port responds, expose a /health route that tests critical dependencies (database, cache, file system). If the database is unreachable, the healthcheck returns a 503 code and your external monitoring detects it immediately.

Also measure the real uptime of your services with precise metrics. The /proc/uptime file gives the time elapsed since the last system reboot, but what your users care about is application availability, not the kernel's.

# System uptime
uptime
cat /proc/uptime  # first number = seconds since boot

# Reboot history
last reboot | head -5

# Local HTTP healthcheck test
curl -sf -o /dev/null -w "%{http_code} - %{time_total}s
" http://localhost:8080/health

# Simple external monitoring script (run from another server)
#!/bin/bash
URL="https://mysite.com/health"
RESPONSE=$(curl -sf -o /dev/null -w "%{http_code}" --max-time 10 "$URL")
if [ "$RESPONSE" != "200" ]; then
    echo "ALERT: $URL returned $RESPONSE" | mail -s "Down: mysite.com" [email protected]
fi

# Check SSL certificate expiration
echo | openssl s_client -connect mysite.com:443 -servername mysite.com 2>/dev/null |
    openssl x509 -noout -dates

Practical tip: Monitor your SSL certificates at least 30 days before expiration. An expired certificate causes an outage that's immediately visible to all your users. Automate renewal with Certbot and add a backup alert in case the automation fails.

Setting up effective alerts

Collecting metrics without configuring alerts is like installing smoke detectors without wiring them up. But the opposite excess is just as problematic: too many alerts leads to alert fatigue, where notifications get systematically ignored. The key is to define relevant thresholds and a progressive escalation system.

Adopt a three-tier classification. Info alerts are logged but don't trigger a notification (e.g. CPU usage at 60%). Warning alerts send a non-urgent notification, typically by email or Slack channel (e.g. disk at 80%). Critical alerts trigger an intrusive notification: SMS, PagerDuty call, mobile push (e.g. service down, disk at 95%). This gradation avoids crying wolf and ensures that critical alerts get the attention they deserve.

Alert templating is just as important as the thresholds. An alert should contain: the affected server, the metric in alarm, the current value, the threshold exceeded and ideally a link to the dashboard. A message like "CRITICAL: disk /var 94% on srv-prod-01" is actionable. A message like "Alert triggered" is not.

# Example of a simple, complete alert script
#!/bin/bash
# monitoring-alerts.sh - run via cron every 5 minutes

HOSTNAME=$(hostname)
ALERT_EMAIL="[email protected]"
SLACK_WEBHOOK="https://hooks.slack.com/services/XXX/YYY/ZZZ"

# Thresholds
DISK_WARN=80
DISK_CRIT=95
MEM_WARN=85
LOAD_WARN=$(nproc | awk '{printf "%.1f", $1 * 0.7}')

# Disk check
df -h --output=pcent,target | tail -n+2 | while read usage mount; do
    pct=${usage%%%}
    if [ "$pct" -ge "$DISK_CRIT" ]; then
        echo "CRITICAL: Disk $mount at ${pct}% on $HOSTNAME" |
            mail -s "[CRIT] Disk $mount - $HOSTNAME" "$ALERT_EMAIL"
    elif [ "$pct" -ge "$DISK_WARN" ]; then
        echo "WARNING: Disk $mount at ${pct}% on $HOSTNAME"
    fi
done

# Available memory check
MEM_AVAIL_PCT=$(free | awk '/Mem:/ {printf "%.0f", $7/$2*100}')
if [ "$MEM_AVAIL_PCT" -le 15 ]; then
    echo "WARNING: Available memory at ${MEM_AVAIL_PCT}% on $HOSTNAME"
fi

# Critical services check
for svc in nginx postgresql redis-server; do
    if ! systemctl is-active --quiet "$svc" 2>/dev/null; then
        echo "CRITICAL: $svc is down on $HOSTNAME" |
            mail -s "[CRIT] Service $svc down - $HOSTNAME" "$ALERT_EMAIL"
    fi
done

Practical tip: Apply the "two eyes" rule for critical alerts: every alert must be explicitly acknowledged. If nobody acknowledges within 15 minutes, the alert is escalated to the next level. This ensures no critical alert falls through the cracks, even at night or on the weekend.

Conclusion: building durable monitoring

Monitoring a production Linux server rests on six pillars: CPU and load, memory and swap, disk and I/O, network, processes and services, and external availability. Each category has its key metrics, its diagnostic commands and its recommended alert thresholds. The most important thing is to start simple: a bash script with basic checks and email notifications already covers the essentials.

To go further and build professional-grade monitoring, the Prometheus + Grafana stack is today's reference. Prometheus collects and stores metrics via node_exporter installed on each server, while Grafana provides visual dashboards and a flexible alerting system. Installing and configuring this stack will be the subject of an upcoming article.

In the meantime, start today: deploy this article's alert script on your servers, set up log rotation, check your inodes and test your healthchecks. Every monitored metric is an incident avoided.

Did you enjoy this article?

Comments

Morgann Riu

Cybersecurity and Linux administration expert. I help companies secure and optimize their critical infrastructures.

Contact me

monitoring Linux DevOps alerting server production metrics

Back to the blog

CPU and system load

Memory and swap

Disk space and I/O

Network

Processes and services

Uptime and availability

Setting up effective alerts

Conclusion: building durable monitoring

Comments

Recommended for you

Veeam CVE-2026-44963 : n'importe quel compte de domaine peut prendre le contrôle de vos sauvegardes

Kubernetes 1.36 « Haru » : User Namespaces en GA, retrait d'Ingress NGINX et durcissement sécurité

Secrets Management en production : Vault, External Secrets et bonnes pratiques 2026

Docker Compose v5 : le SDK Go qui change tout pour l'automatisation

Related tutorial

Checklist Sécurité Linux