Performance Problems Are Diagnostic Problems First
“The site is slow” is the worst diagnostic input you’ll receive. Slow at the application layer? Database? Network? Disk? CPU? Each has different fingerprints in vmstat, iostat, perf, and friends. The ten errors below are the most common observability problems — the patterns to recognize and the tools to use.
#141 High load average, low %CPU
Description: Load is 25 on a 16-core box but top shows mostly idle CPU.
Diagnosis: Linux load includes D-state (uninterruptible I/O wait) processes. Check %wa in top — that’s your iowait. iostat -x 1 for which device.
#142 p99 latency tail (mostly fast, occasionally horrible)
Diagnosis: Average latency lies. Look at p99/p99.9. Common culprits: GC pauses, lock contention, swap activity, scheduling jitter.
Tools: perf top; histograms via Prometheus; bcc/eBPF tools (biolatency, tcpconnlat).
#143 high context switch rate
Description: vmstat 1 shows cs column > 100k/sec.
Diagnosis: Process spawning, lock contention, or interrupt storms. pidstat -wt 1 shows top context-switching threads.
#144 disk I/O saturated (%util at 100)
Diagnosis: iostat -x 1 per-device; iotop per-process; biolatency for distribution. Long await = device queue full.
#145 network bandwidth at line rate
Diagnosis: nload, iftop, or sar -n DEV 1; check both directions; small packets at high rate = different problem than bulk transfer.
#146 memory pressure without OOM
Diagnosis: vmstat 1 — non-zero si/so means thrashing; free -h — available low even though buff/cache is large; cat /proc/pressure/memory on modern kernels.
#147 slow boot
Solution: systemd-analyze blame shows slowest services; systemd-analyze critical-chain shows bottleneck path; mask services that aren’t needed.
#148 application appears hung
Solution: strace -f -p PID shows what syscall it’s waiting in; cat /proc/PID/wchan for kernel function; cat /proc/PID/stack for stack trace; thread dump for JVM apps.
#149 erratic kernel performance after upgrade
Diagnosis: Mitigations from CPU vulnerability patches (Spectre/Meltdown/etc.) cost performance. Check cat /sys/devices/system/cpu/vulnerabilities/*.
#150 noisy neighbor (cloud)
Diagnosis: %steal in top > 0 = hypervisor giving your CPU to another VM. If sustained: file a ticket with cloud provider; sometimes resolved by stop+start (lands on different hypervisor).
Conclusion
- Always look at p99 / p99.9 latency, not averages.
- The 60-second triage:
uptime,dmesg | tail,vmstat 1 5,iostat -x 1 5,free -h. - Install
sysstat(sar) on every server. Historical data wins arguments. perf topand bcc/eBPF tools beat guessing — learn at leastbiolatencyandexecsnoop.- Cloud noisy-neighbor:
%stealin top. If sustained, file a ticket.
Related Linux Admin articles
- Linux System Monitoring — the vmstat / iostat / top / sar reference
- Linux Process & Memory Errors — for OOM and process-level issues
- Linux Database Errors — for slow queries / DB performance