Stop Using Load Average: PSI is What You Actually Want
Load average is one of the most-cited and least-useful metrics on a Linux box. It conflates running and uninterruptible processes, smooths over 1/5/15-minute windows, and tells you nothing about what the system is waiting on. A load of 8 on a 16-core box could mean perfectly healthy CPU usage, or it could mean half your processes are blocked on a saturated disk. Same number, completely different problem.
PSI — Pressure Stall Information — is the answer. It's been in mainline since kernel 4.20, lives under /proc/pressure/, and almost nobody uses it.
What PSI exposes
Three files:
cat /proc/pressure/cpu
cat /proc/pressure/memory
cat /proc/pressure/io
Each one reports two metrics — some (at least one task stalled) and full (all non-idle tasks stalled) — over 10s, 60s, and 300s windows, plus a cumulative microsecond counter:
some avg10=15.30 avg60=12.84 avg300=8.71 total=8932145672
full avg10=4.20 avg60=3.15 avg300=2.04 total=2145987632
The avg* values are percentages of time stalled. avg10=15.30 on /proc/pressure/io means tasks were waiting for I/O 15.3% of the last 10 seconds. That's an actionable number — not a context-free 8.5 from uptime.
Diagnosis recipes
"Server feels slow but CPU is at 40%"
cat /proc/pressure/io
If some avg10 is above 5%, your bottleneck is disk, not CPU. Confirm with iostat -xz 1 — look at %util and await.
"OOM killer keeps firing but free memory looks fine"
cat /proc/pressure/memory
If some avg60 is above 10%, the kernel is thrashing — pages reclaimed and immediately re-faulted. Free-memory counters lie because page cache shows as "available" while actively being churned. PSI memory pressure tells you when reclaim itself is the bottleneck.
"Database queries are slow at peak hours, can't pinpoint why"
Sample all three during the slow window:
while sleep 5; do
echo "=== $(date +%T) ==="
for f in cpu memory io; do
printf '%-7s ' "$f"
awk '/^some/ {print $2, $3, $4}' /proc/pressure/$f
done
done
Whichever stays elevated is your real bottleneck. Three minutes of sampling beats an hour of guessing.
Per-cgroup PSI: the killer feature
Under cgroup v2, every cgroup gets its own pressure files:
cat /sys/fs/cgroup/system.slice/mariadb.service/cpu.pressure
cat /sys/fs/cgroup/system.slice/mariadb.service/io.pressure
cat /sys/fs/cgroup/system.slice/mariadb.service/memory.pressure
On a multi-tenant box this attributes pressure to specific services. Sort everything by I/O pressure in one shot:
for svc in /sys/fs/cgroup/system.slice/*.service; do
val=$(awk '/^some/ {gsub("avg10=",""); print $2}' "$svc/io.pressure" 2>/dev/null)
[ -n "$val" ] && printf '%-40s %s\n' "$(basename "$svc")" "$val"
done | sort -k2 -n -r | head
Try doing that with iotop across 50 services.
Wire it into your monitoring
Prometheus: node_exporter v1.5.0+ exposes PSI as node_pressure_*_waiting_seconds_total and node_pressure_*_stalled_seconds_total. Already there if you're on a recent build.
Alert on sustained pressure, not spikes:
- alert: SustainedIOPressure
expr: rate(node_pressure_io_waiting_seconds_total[5m]) > 0.20
for: 10m
annotations:
summary: "I/O pressure >20% for 10m on {{ $labels.instance }}"
This fires only on real, sustained contention. No noise from one-off bursts.
systemd-oomd uses memory PSI to kill processes before the kernel's OOM killer reaches them — usually a much better choice than which-PID-has-the-highest-RSS:
systemctl enable --now systemd-oomd
Configure thresholds in /etc/systemd/oomd.conf. On a host with critical services and noisy neighbors, this prevents the wrong service from getting killed when memory tightens.
Gotchas
⚠️ Per-cgroup PSI requires cgroup v2. If you're still on hybrid or v1 hierarchies, only the global /proc/pressure/* files work. Migrate with systemd.unified_cgroup_hierarchy=1 on the kernel cmdline. Most modern distros are already there by default.
⚠️ PSI counts uninterruptible sleep as pressure. A process blocked on a slow NFS mount registers as I/O pressure on the local host. That's technically correct but worth knowing when interpreting numbers on hosts with network filesystems or remote block storage.
⚠️ full avg* on /proc/pressure/cpu is always 0 because at least one task — the kernel — is always runnable. Use some for CPU diagnosis. The full line is meaningful for memory and I/O only.
⚠️ Containers can read PSI from inside. If you're running Docker/Podman/Kubernetes, the container's own cgroup PSI is visible at /proc/pressure/* from inside if /proc is mounted normally. Useful for in-container observability without host access.
Replace your "load average is high" pages with "I/O pressure sustained above 20% for 10 minutes." The signal-to-noise ratio improves dramatically, and the alert text actually tells the on-call engineer where to look first.