A Linux server running in production with no monitoring is like driving at night with no dashboard: you don't know how fast you're going, how much fuel is left, or whether the engine is overheating. By the time the problem becomes visible, it's often too late. The service is already down, users are complaining, and you're firefighting at 3 a.m.
Monitoring isn't a luxury reserved for large infrastructures. Even a 5-euro-a-month VPS deserves a minimum of oversight. The difference between a contained incident and a full-blown disaster often comes down to a single alert fired 15 minutes before the disk fills up or the OOM killer goes on a rampage.
In this article, we'll go through the fundamental metrics to monitor on a production Linux server. For each category, I give you the useful commands, the recommended alert thresholds and the pitfalls to avoid. The goal is pragmatic: you walk away with an actionable checklist.
CPU and system load
CPU load is the first metric you look at when a server is dragging. But be careful not to confuse CPU utilization with load average. CPU utilization measures the percentage of time the processor spends actually working. Load average, on the other hand, represents the average number of processes waiting to run over a given period (1, 5 and 15 minutes). A load average of 4.0 on a 4-core machine means the system is saturated. Beyond that, processes start queuing up.
The uptime command gives a quick overview, but for finer analysis, mpstat (from the sysstat package) breaks utilization down by core and by type (user, system, iowait, idle). The %iowait value is especially telling: a high figure means the CPU is waiting on the disk, which points to an I/O bottleneck rather than a lack of compute power.
# Quick load average
uptime
# 14:23:07 up 42 days, load average: 1.82, 2.15, 1.93
# Per-core detail, refreshed every 2 seconds
mpstat -P ALL 2
# Top 10 most CPU-hungry processes
ps aux --sort=-%cpu | head -11
# Real-time view sorted by CPU
top -bn1 -o %CPU | head -20
Practical tip: Set up an alert when the load average exceeds 0.7 times the number of cores for more than 5 minutes. For example, on a 4-core server, alert from a sustained load average of 2.8. That leaves some margin before complete saturation.
Memory and swap
Memory management under Linux is often misunderstood. Seeing 95% of RAM used isn't necessarily a problem: Linux aggressively uses available memory as a disk cache, which speeds up reads. What matters is available memory (the available column in free -h), not free memory (free). Available memory includes caches the kernel can release instantly if an application needs them.
Swap, on the other hand, is a warning sign. If the system is actively swapping (which you can check with vmstat, columns si and so), performance degrades dramatically because the disk is thousands of times slower than RAM. The vm.swappiness parameter controls how aggressively the system swaps: a value of 10 is appropriate for most production servers, versus the default of 60.
The worst-case scenario is the OOM killer. When the kernel runs out of available memory, it kills the most memory-hungry process. And it's not always the one you would have chosen. Watch the logs to catch these events before they become recurring.
# Memory and swap overview
free -h
# total used free shared buff/cache available
# Mem: 15Gi 8.2Gi 512Mi 256Mi 6.8Gi 6.5Gi
# Swap: 2.0Gi 128Mi 1.9Gi
# Real-time swap activity (si/so = swap in/out per second)
vmstat 2 5
# Check current swappiness
cat /proc/sys/vm/swappiness
# Lower swappiness (persistent via sysctl.conf)
echo "vm.swappiness=10" | sudo tee -a /etc/sysctl.d/99-tuning.conf
sudo sysctl -p /etc/sysctl.d/99-tuning.conf
# Detect OOM killer events in the logs
sudo dmesg | grep -i "oom|out of memory"
sudo journalctl -k | grep -i "oom"
Practical tip: Alert when available memory drops below 15% of total RAM for more than 10 minutes. And keep an eye on swap usage: any sustained swap activity (si+so > 0 for several minutes) warrants investigation.
Disk space and I/O
A full disk is probably the number one cause of avoidable outages. A log file that grows without rotation, a database filling up its partition, a /tmp saturated with forgotten temporary files: the scenarios are classic but keep tripping up experienced administrators. The df -h command shows partition usage, while du -sh lets you hunt down the largest directories.
A lesser-known pitfall: inode exhaustion. Even with free disk space, if the number of inodes is depleted (millions of small files, for example), no new file can be created. Check with df -i. On the I/O performance side, iostat reveals disk throughput and latency. An average latency (await) above 20ms on an SSD is suspicious.
To identify which process is consuming the most disk I/O, iotop is indispensable. It works like top but for disk operations. It's particularly useful when the CPU's %iowait is high and you're looking for the culprit.
# Disk space per partition
df -h
# Available inodes (often forgotten!)
df -i
# Find the 10 largest directories in /var
sudo du -sh /var/*/ 2>/dev/null | sort -rh | head -10
# Disk I/O statistics (2s interval, 5 samples)
iostat -xz 2 5
# Identify the most I/O-hungry processes
sudo iotop -oP -d 2
# Find files modified in the last 24h (handy for tracking down logs)
find /var/log -type f -mtime -1 -exec ls -lh {} ; | sort -k5 -rh | head -10
Practical tip: Set up two alert thresholds for the disk: a warning at 80% usage and a critical alert at 90%. Don't forget to also monitor inodes with
df -i, especially on mail servers or systems with a lot of session files.
Network
Network monitoring covers several aspects: bandwidth consumed, number of active connections, interface errors and connections in suspicious states. The ss command (the modern replacement for netstat) is your Swiss Army knife. It displays TCP/UDP sockets with their states, queues and associated processes.
Connections in the TIME_WAIT state deserve special attention. A high count (several thousand) often indicates a configuration problem: HTTP connections not being reused, lack of keep-alive, or simply very high traffic. For real-time bandwidth, iftop shows traffic per connection, while nethogs aggregates it per process, which is more convenient for identifying which service is consuming bandwidth.
Also keep an eye on errors and dropped packets on network interfaces. A non-zero error rate can indicate a hardware problem (faulty cable, network card nearing end of life) or saturation of the network buffer.
# Number of TCP connections per state
ss -s
# Established connections, sorted by count per remote IP
ss -tn state established | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -rn | head -10
# Count TIME_WAIT connections
ss -tn state time-wait | wc -l
# Bandwidth per process
sudo nethogs eth0
# Interface statistics (errors, drops, overruns)
ip -s link show eth0
# Real-time network throughput
sar -n DEV 2 5
Practical tip: Alert if the number of
TIME_WAITconnections exceeds 10,000 or if the dropped-packet rate (dropped) is above 0.1%. These symptoms often point to a sizing or network configuration problem that's only going to get worse.
Processes and services
Knowing which services are running and in what state is fundamental. With systemd, the systemctl command lists active, failed or masked units. A service in the failed state going unnoticed for days is a classic. Automate the check: a simple script that runs systemctl --failed and sends an alert if the output isn't empty is enough to avoid plenty of surprises.
Zombie processes are another indicator to watch. A zombie is a terminated process whose parent hasn't read its return code. A few zombies are harmless, but a growing number indicates a bug in the parent process. Finally, journalctl is indispensable for analyzing service logs. Use the per-unit and per-period filters so you don't drown in the noise.
# Failed services
systemctl --failed
# Status of a specific service with the last log lines
systemctl status nginx.service
# Logs of a service since the last hour
journalctl -u nginx.service --since "1 hour ago" --no-pager
# Count zombie processes
ps aux | awk '$8 == "Z" {count++} END {print "Zombies:", count+0}'
# Identify zombie processes and their parent
ps -eo pid,ppid,stat,cmd | awk '$3 ~ /Z/'
# Automatic check of critical services
for svc in nginx postgresql redis; do
systemctl is-active --quiet $svc || echo "ALERT: $svc is down!"
done
Practical tip: Create a healthcheck script that checks your critical services every minute via cron. A simple
systemctl is-active --quietfollowed by a notification (email, Slack webhook) covers 80% of basic service monitoring needs.
Uptime and availability
Internal monitoring is necessary but not sufficient. If your server is unreachable from the internet, your local scripts won't be able to warn you. That's why external monitoring is essential: a third-party service regularly checks that your endpoints respond correctly. Tools like UptimeRobot (free for 50 monitors), Hetrixtools, or the open source Uptime Kuma project let you monitor HTTP availability, SSL certificates and response times.
The healthcheck endpoint concept is a best practice. Rather than simply checking that a port responds, expose a /health route that tests critical dependencies (database, cache, file system). If the database is unreachable, the healthcheck returns a 503 code and your external monitoring detects it immediately.
Also measure the real uptime of your services with precise metrics. The /proc/uptime file gives the time elapsed since the last system reboot, but what your users care about is application availability, not the kernel's.
# System uptime
uptime
cat /proc/uptime # first number = seconds since boot
# Reboot history
last reboot | head -5
# Local HTTP healthcheck test
curl -sf -o /dev/null -w "%{http_code} - %{time_total}s
" http://localhost:8080/health
# Simple external monitoring script (run from another server)
#!/bin/bash
URL="https://mysite.com/health"
RESPONSE=$(curl -sf -o /dev/null -w "%{http_code}" --max-time 10 "$URL")
if [ "$RESPONSE" != "200" ]; then
echo "ALERT: $URL returned $RESPONSE" | mail -s "Down: mysite.com" [email protected]
fi
# Check SSL certificate expiration
echo | openssl s_client -connect mysite.com:443 -servername mysite.com 2>/dev/null |
openssl x509 -noout -dates
Practical tip: Monitor your SSL certificates at least 30 days before expiration. An expired certificate causes an outage that's immediately visible to all your users. Automate renewal with Certbot and add a backup alert in case the automation fails.
Setting up effective alerts
Collecting metrics without configuring alerts is like installing smoke detectors without wiring them up. But the opposite excess is just as problematic: too many alerts leads to alert fatigue, where notifications get systematically ignored. The key is to define relevant thresholds and a progressive escalation system.
Adopt a three-tier classification. Info alerts are logged but don't trigger a notification (e.g. CPU usage at 60%). Warning alerts send a non-urgent notification, typically by email or Slack channel (e.g. disk at 80%). Critical alerts trigger an intrusive notification: SMS, PagerDuty call, mobile push (e.g. service down, disk at 95%). This gradation avoids crying wolf and ensures that critical alerts get the attention they deserve.
Alert templating is just as important as the thresholds. An alert should contain: the affected server, the metric in alarm, the current value, the threshold exceeded and ideally a link to the dashboard. A message like "CRITICAL: disk /var 94% on srv-prod-01" is actionable. A message like "Alert triggered" is not.
# Example of a simple, complete alert script
#!/bin/bash
# monitoring-alerts.sh - run via cron every 5 minutes
HOSTNAME=$(hostname)
ALERT_EMAIL="[email protected]"
SLACK_WEBHOOK="https://hooks.slack.com/services/XXX/YYY/ZZZ"
# Thresholds
DISK_WARN=80
DISK_CRIT=95
MEM_WARN=85
LOAD_WARN=$(nproc | awk '{printf "%.1f", $1 * 0.7}')
# Disk check
df -h --output=pcent,target | tail -n+2 | while read usage mount; do
pct=${usage%%%}
if [ "$pct" -ge "$DISK_CRIT" ]; then
echo "CRITICAL: Disk $mount at ${pct}% on $HOSTNAME" |
mail -s "[CRIT] Disk $mount - $HOSTNAME" "$ALERT_EMAIL"
elif [ "$pct" -ge "$DISK_WARN" ]; then
echo "WARNING: Disk $mount at ${pct}% on $HOSTNAME"
fi
done
# Available memory check
MEM_AVAIL_PCT=$(free | awk '/Mem:/ {printf "%.0f", $7/$2*100}')
if [ "$MEM_AVAIL_PCT" -le 15 ]; then
echo "WARNING: Available memory at ${MEM_AVAIL_PCT}% on $HOSTNAME"
fi
# Critical services check
for svc in nginx postgresql redis-server; do
if ! systemctl is-active --quiet "$svc" 2>/dev/null; then
echo "CRITICAL: $svc is down on $HOSTNAME" |
mail -s "[CRIT] Service $svc down - $HOSTNAME" "$ALERT_EMAIL"
fi
done
Practical tip: Apply the "two eyes" rule for critical alerts: every alert must be explicitly acknowledged. If nobody acknowledges within 15 minutes, the alert is escalated to the next level. This ensures no critical alert falls through the cracks, even at night or on the weekend.
Conclusion: building durable monitoring
Monitoring a production Linux server rests on six pillars: CPU and load, memory and swap, disk and I/O, network, processes and services, and external availability. Each category has its key metrics, its diagnostic commands and its recommended alert thresholds. The most important thing is to start simple: a bash script with basic checks and email notifications already covers the essentials.
To go further and build professional-grade monitoring, the Prometheus + Grafana stack is today's reference. Prometheus collects and stores metrics via node_exporter installed on each server, while Grafana provides visual dashboards and a flexible alerting system. Installing and configuring this stack will be the subject of an upcoming article.
In the meantime, start today: deploy this article's alert script on your servers, set up log rotation, check your inodes and test your healthchecks. Every monitored metric is an incident avoided.
Comments