Linux Performance & Observability: load, latency tails, perf top, eBPF

Performance Problems Are Diagnostic Problems First

“The site is slow” is the worst diagnostic input you’ll receive. Slow at the application layer? Database? Network? Disk? CPU? Each has different fingerprints in vmstat, iostat, perf, and friends. The ten errors below are the most common observability problems — the patterns to recognize and the tools to use.

#141 – High load average, low %CPU

Description: Load is 25 on a 16-core box but top shows mostly idle CPU.

Diagnosis: Linux load includes D-state (uninterruptible I/O wait) processes. Check %wa in top — that’s your iowait. iostat -x 1 for which device.

#142 – p99 latency tail (mostly fast, occasionally horrible)

Description: Average latency looks fine but a small fraction of requests take seconds.

Diagnosis: Average latency lies. Look at p99/p99.9. Common culprits: GC pauses, lock contention, swap activity, scheduling jitter.

Tools: perf top; histograms via Prometheus; bcc/eBPF tools (biolatency, tcpconnlat).

#143 – high context switch rate

Description: vmstat 1 shows cs column > 100k/sec.

Diagnosis: Process spawning, lock contention, or interrupt storms. pidstat -wt 1 shows top context-switching threads.

#144 – disk I/O saturated (%util at 100)

Description: Every disk-bound operation queues; iostat shows the device pegged at 100% util.

Diagnosis: iostat -x 1 per-device; iotop per-process; biolatency for distribution. Long await = device queue full.

#145 – network bandwidth at line rate

Description: The interface is sustaining its full advertised speed and traffic is bottlenecked there.

Diagnosis: nload, iftop, or sar -n DEV 1; check both directions; small packets at high rate = different problem than bulk transfer.

#146 – memory pressure without OOM

Description: System feels slow but no process has been killed; cache is being reclaimed aggressively.

Diagnosis: vmstat 1 — non-zero si/so means thrashing; free -h — available low even though buff/cache is large; cat /proc/pressure/memory on modern kernels.

#147 – slow boot

Description: Systemd takes minutes to reach multi-user.target instead of seconds.

Solution: systemd-analyze blame shows slowest services; systemd-analyze critical-chain shows bottleneck path; mask services that aren’t needed.

#148 – application appears hung

Description: Process is alive but unresponsive to clients and signals.

Solution: strace -f -p PID shows what syscall it’s waiting in; cat /proc/PID/wchan for kernel function; cat /proc/PID/stack for stack trace; thread dump for JVM apps.

#149 – erratic kernel performance after upgrade

Description: The same workload runs noticeably slower after a kernel update.

Diagnosis: Mitigations from CPU vulnerability patches (Spectre/Meltdown/etc.) cost performance. Check cat /sys/devices/system/cpu/vulnerabilities/*.

#150 – noisy neighbor (cloud)

Description: VM performance is inconsistent because another tenant is competing for hypervisor resources.

Diagnosis: %steal in top > 0 = hypervisor giving your CPU to another VM. If sustained: file a ticket with cloud provider; sometimes resolved by stop+start (lands on different hypervisor).

Conclusion

Always look at p99 / p99.9 latency, not averages.
The 60-second triage: uptime, dmesg | tail, vmstat 1 5, iostat -x 1 5, free -h.
Install sysstat (sar) on every server. Historical data wins arguments.
perf top and bcc/eBPF tools beat guessing — learn at least biolatency and execsnoop.
Cloud noisy-neighbor: %steal in top. If sustained, file a ticket.

Tags: #Observability