Linux Admin

Linux System Monitoring: vmstat, top, iostat, sar, journalctl

Part of pathway: Linux Mastery: 300 Commands

The First Five Commands When Something’s Slow

A server is slow. The phone is ringing. You SSH in. The first 60 seconds are diagnostic gold — before you start fixing anything, you need to know whether you’re looking at CPU saturation, memory pressure, disk I/O queueing, or network blowout. This article is the working playbook: which commands to run, in what order, and what to look at in their output.

The 60-Second Diagnostic Sequence

uptime                                   # load averages first
dmesg -T | tail -20                      # any kernel-level alarm bells?
vmstat 1 5                               # CPU + memory + I/O at a glance
free -h                                  # memory specifically
df -h ; df -i                            # disk usage AND inodes

That’s the standard 60-second triage. Most incidents reveal themselves here. Then drill in:

uptime — Load Average

$ uptime
 14:23:01 up 12 days,  3:42,  4 users,  load average: 1.42, 1.10, 0.95

Three numbers: load average over 1 minute, 5 minutes, 15 minutes. Load is roughly “runnable processes + uninterruptible-IO processes.” Compare to your CPU count:

  • Load < CPU count — idle headroom
  • Load = CPU count — fully utilized
  • Load > CPU count — queuing, work piling up

Get CPU count with nproc. If your 5-minute average is 24 on a 16-CPU box, you’re saturating.

top, htop, btop

Real-time process viewers. top ships everywhere; htop is the better UI; btop is the prettiest but install-required.

What to watch in top:

  • %CPU per process — one process pinning a core?
  • %MEM — memory hog?
  • %wa in the header — iowait: CPU idle waiting for disk. Anything >5% sustained means disk pressure.
  • %si — software interrupts. High = network or driver pressure.

vmstat — The Single Best Triage Tool

$ vmstat 1 5
procs -----memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0      0 542384  18272 4189832    0    0    24    87  120  290  4  1 95  0  0
 3  0      0 540212  18272 4189832    0    0     0   124  890 2104 18  3 79  0  0

Eight numbers that matter:

  • r — runnable processes. If r > CPU count consistently, you’re CPU-bound.
  • b — blocked on I/O. High = disk or network bottleneck.
  • swpd — swap used. Non-zero & growing = memory pressure.
  • si/so — swap in/out per second. Non-zero is bad news. Anything > 0 sustained means thrashing.
  • bi/bo — blocks read/written per second. Indicates disk pressure.
  • us / sy / id / wa — user / system / idle / iowait CPU. Add up to 100 across all CPUs.

Memory — free

$ free -h
              total        used        free      shared  buff/cache   available
Mem:           16Gi       7.0Gi       2.1Gi       512Mi       6.9Gi       8.5Gi
Swap:         4.0Gi          0B       4.0Gi

Look at available, not free. Linux uses unused memory for disk cache (buff/cache); that’s good, not a leak. available is what an application could realistically allocate without forcing swap.

If available is small AND swap usage is climbing AND si/so in vmstat is non-zero, you’re running out of RAM and the kernel is desperately swapping.

Disk I/O — iostat

iostat -x 1                              # per-device extended stats, refreshed every 1s
iostat -xz 1                             # skip idle devices

Watch:

  • %util — how busy the device is. 100% = device queue never empty.
  • r/s + w/s — read and write IOPS
  • rkB/s + wkB/s — bandwidth
  • await — average request latency in ms. >20ms on SSD = pressure; >100ms = serious problem

dmesg — Kernel Ring Buffer

dmesg -T                                 # human-readable timestamps
dmesg -T | tail -50                      # most recent
dmesg -T --level=err,warn                # only errors and warnings

The kernel logs hardware errors, OOM-killer activity, network failures, and driver complaints here. ALWAYS check after an incident — OOM-killed processes show up in dmesg as oom-kill: ... Killed process.

journalctl — The systemd Log Tool

journalctl -xe                           # latest, with explanations
journalctl -u nginx                      # for one service
journalctl -u nginx -f                   # follow live
journalctl --since "2 hours ago"
journalctl --since today --priority=err
journalctl -k                            # kernel only (same as dmesg)
journalctl --disk-usage                  # how much space the journal uses

journalctl -u service-name -f is the systemd equivalent of tail -f /var/log/.... Modern Linux puts everything here.

sar — Historical Data

sar -u                                   # CPU history (last day)
sar -r                                   # memory history
sar -d                                   # disk history
sar -n DEV                               # network history
sar -u 1 5                               # CPU live, 1s, 5 samples
sar -f /var/log/sysstat/sa15             # specific day's archive

Install sysstat on production servers. It records system metrics every 10 minutes, retains 30 days. When someone asks “was the system slow at 3 AM yesterday,” sar tells you.

Networking Quick Checks

ss -s                                    # socket statistics summary
ss -tunlp                                # all listening sockets with PIDs
ss -tan state established                # established TCP connections
netstat -i                               # per-interface counters (drops, errors)
ip -s link                               # newer alternative to netstat -i

Interface drops or errors growing = link problems. ethtool eth0 gives the hardware perspective.

Per-Process Deep Dive

strace -p 1234                           # see what syscalls a process is making
strace -c -p 1234                        # syscall summary (counts + time)
lsof -p 1234                             # all files this PID has open
lsof -i :443                             # who has port 443 open
pmap -x 1234                             # memory map of a process

strace is the truth-teller when a process is “hanging.” You’ll see exactly which syscall it’s blocked in — usually read on a socket, poll on file descriptors, or futex on a lock.

Common Pitfalls

  • Looking at free instead of available. Linux file caching makes free look small even on idle systems.
  • One vmstat sample. The first sample is averaged since boot — meaningless. Always grab at least 5 samples and look at the second through fifth.
  • High load average isn’t always CPU. Linux load includes uninterruptible I/O waits. A box with 50 load and 0% CPU is disk-blocked, not CPU-saturated.
  • Forgetting iowait. %wa in top > 5% is the diagnostic for disk pressure that doesn’t show up as “CPU busy.”
  • No historical data. Without sar, you can’t answer “was it slow yesterday at 3 AM?” Install sysstat on every production box.

Conclusion

Five habits:

  1. The 60-second sequence: uptime, dmesg -T, vmstat 1 5, free -h, df -h.
  2. htop in a tmux pane during any non-trivial debugging.
  3. journalctl -u SERVICE -f as the modern tail -f.
  4. Install sysstat for historical metrics.
  5. When a process is hanging, strace -p PID tells you what it’s actually waiting on.

Related Linux Admin troubleshooting

For common errors and fixes related to this topic, see:

Leave a Reply