The First Five Commands When Something’s Slow
A server is slow. The phone is ringing. You SSH in. The first 60 seconds are diagnostic gold — before you start fixing anything, you need to know whether you’re looking at CPU saturation, memory pressure, disk I/O queueing, or network blowout. This article is the working playbook: which commands to run, in what order, and what to look at in their output.
The 60-Second Diagnostic Sequence
uptime # load averages first
dmesg -T | tail -20 # any kernel-level alarm bells?
vmstat 1 5 # CPU + memory + I/O at a glance
free -h # memory specifically
df -h ; df -i # disk usage AND inodes
That’s the standard 60-second triage. Most incidents reveal themselves here. Then drill in:
uptime — Load Average
$ uptime
14:23:01 up 12 days, 3:42, 4 users, load average: 1.42, 1.10, 0.95
Three numbers: load average over 1 minute, 5 minutes, 15 minutes. Load is roughly “runnable processes + uninterruptible-IO processes.” Compare to your CPU count:
- Load < CPU count — idle headroom
- Load = CPU count — fully utilized
- Load > CPU count — queuing, work piling up
Get CPU count with nproc. If your 5-minute average is 24 on a 16-CPU box, you’re saturating.
top, htop, btop
Real-time process viewers. top ships everywhere; htop is the better UI; btop is the prettiest but install-required.
What to watch in top:
- %CPU per process — one process pinning a core?
- %MEM — memory hog?
- %wa in the header — iowait: CPU idle waiting for disk. Anything >5% sustained means disk pressure.
- %si — software interrupts. High = network or driver pressure.
vmstat — The Single Best Triage Tool
$ vmstat 1 5
procs -----memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 0 0 542384 18272 4189832 0 0 24 87 120 290 4 1 95 0 0
3 0 0 540212 18272 4189832 0 0 0 124 890 2104 18 3 79 0 0
Eight numbers that matter:
- r — runnable processes. If r > CPU count consistently, you’re CPU-bound.
- b — blocked on I/O. High = disk or network bottleneck.
- swpd — swap used. Non-zero & growing = memory pressure.
- si/so — swap in/out per second. Non-zero is bad news. Anything > 0 sustained means thrashing.
- bi/bo — blocks read/written per second. Indicates disk pressure.
- us / sy / id / wa — user / system / idle / iowait CPU. Add up to 100 across all CPUs.
Memory — free
$ free -h
total used free shared buff/cache available
Mem: 16Gi 7.0Gi 2.1Gi 512Mi 6.9Gi 8.5Gi
Swap: 4.0Gi 0B 4.0Gi
Look at available, not free. Linux uses unused memory for disk cache (buff/cache); that’s good, not a leak. available is what an application could realistically allocate without forcing swap.
If available is small AND swap usage is climbing AND si/so in vmstat is non-zero, you’re running out of RAM and the kernel is desperately swapping.
Disk I/O — iostat
iostat -x 1 # per-device extended stats, refreshed every 1s
iostat -xz 1 # skip idle devices
Watch:
- %util — how busy the device is. 100% = device queue never empty.
- r/s + w/s — read and write IOPS
- rkB/s + wkB/s — bandwidth
- await — average request latency in ms. >20ms on SSD = pressure; >100ms = serious problem
dmesg — Kernel Ring Buffer
dmesg -T # human-readable timestamps
dmesg -T | tail -50 # most recent
dmesg -T --level=err,warn # only errors and warnings
The kernel logs hardware errors, OOM-killer activity, network failures, and driver complaints here. ALWAYS check after an incident — OOM-killed processes show up in dmesg as oom-kill: ... Killed process.
journalctl — The systemd Log Tool
journalctl -xe # latest, with explanations
journalctl -u nginx # for one service
journalctl -u nginx -f # follow live
journalctl --since "2 hours ago"
journalctl --since today --priority=err
journalctl -k # kernel only (same as dmesg)
journalctl --disk-usage # how much space the journal uses
journalctl -u service-name -f is the systemd equivalent of tail -f /var/log/.... Modern Linux puts everything here.
sar — Historical Data
sar -u # CPU history (last day)
sar -r # memory history
sar -d # disk history
sar -n DEV # network history
sar -u 1 5 # CPU live, 1s, 5 samples
sar -f /var/log/sysstat/sa15 # specific day's archive
Install sysstat on production servers. It records system metrics every 10 minutes, retains 30 days. When someone asks “was the system slow at 3 AM yesterday,” sar tells you.
Networking Quick Checks
ss -s # socket statistics summary
ss -tunlp # all listening sockets with PIDs
ss -tan state established # established TCP connections
netstat -i # per-interface counters (drops, errors)
ip -s link # newer alternative to netstat -i
Interface drops or errors growing = link problems. ethtool eth0 gives the hardware perspective.
Per-Process Deep Dive
strace -p 1234 # see what syscalls a process is making
strace -c -p 1234 # syscall summary (counts + time)
lsof -p 1234 # all files this PID has open
lsof -i :443 # who has port 443 open
pmap -x 1234 # memory map of a process
strace is the truth-teller when a process is “hanging.” You’ll see exactly which syscall it’s blocked in — usually read on a socket, poll on file descriptors, or futex on a lock.
Common Pitfalls
- Looking at
freeinstead ofavailable. Linux file caching makesfreelook small even on idle systems. - One
vmstatsample. The first sample is averaged since boot — meaningless. Always grab at least 5 samples and look at the second through fifth. - High load average isn’t always CPU. Linux load includes uninterruptible I/O waits. A box with 50 load and 0% CPU is disk-blocked, not CPU-saturated.
- Forgetting
iowait.%waintop> 5% is the diagnostic for disk pressure that doesn’t show up as “CPU busy.” - No historical data. Without
sar, you can’t answer “was it slow yesterday at 3 AM?” Installsysstaton every production box.
Conclusion
Five habits:
- The 60-second sequence:
uptime,dmesg -T,vmstat 1 5,free -h,df -h. htopin a tmux pane during any non-trivial debugging.journalctl -u SERVICE -fas the moderntail -f.- Install
sysstatfor historical metrics. - When a process is hanging,
strace -p PIDtells you what it’s actually waiting on.
Related Linux Admin troubleshooting
For common errors and fixes related to this topic, see: