Linux Admin

Linux Process & Memory Errors: OOM, ulimit, fork failures

Part of pathway: Linux Troubleshooting: 150 Common Errors

Process and Memory Errors — The OOM Killer and Friends

When a Linux system is under memory pressure, the kernel’s OOM killer activates, scoring every process and terminating the highest scorer. The error messages that result — “Cannot allocate memory”, “Resource temporarily unavailable”, “Too many open files” — are the second-most-common Linux production issues after disk problems. This article walks through the ten you’ll see most often.

#011 Out of memory: Killed process X

Description: The kernel’s OOM killer terminated a process to reclaim memory.

Root cause: Total RAM + swap was insufficient for the working set; kernel picked the process with highest oom_score and sent SIGKILL.

Solution: dmesg | grep -i oom shows what was killed; cat /proc/<PID>/oom_score shows scoring; lower critical-process score with echo -1000 > /proc/<PID>/oom_score_adj; add RAM or swap; investigate the leak with pmap -x <PID>.

#012 Cannot allocate memory (ENOMEM)

Description: A malloc() or fork() returns ENOMEM despite free memory appearing available.

Root cause: Either RAM is genuinely exhausted, OR vm.overcommit_memory=2 is restricting allocations, OR per-process limits are hit.

Solution: free -h; check cat /proc/sys/vm/overcommit_memory; ulimit -a for the process; sysctl -w vm.overcommit_memory=1 for permissive overcommit.

#013 Too many open files (EMFILE)

Description: A process cannot open additional file descriptors.

Root cause: The process hit its ulimit -n soft limit on file descriptors.

Solution: cat /proc/<PID>/limits; raise with ulimit -Sn 65536; permanent: edit /etc/security/limits.conf or systemd unit LimitNOFILE=65536; check global ceiling: sysctl fs.file-max.

#014 Resource temporarily unavailable (EAGAIN on fork)

Description: fork() fails because the user has too many running processes.

Solution: ps -u username | wc -l; raise nproc in limits.conf; check for fork()-bomb behavior in misbehaving shell loops.

#015 Segmentation fault (SIGSEGV)

Description: A process accessed memory it didn’t own and was killed.

Solution: Enable core dumps: ulimit -c unlimited; analyze with gdb /path/to/binary core; for repeating production crashes, run under strace or valgrind.

#016 Killed (out of swap)

Description: Process killed without explicit OOM message; system was thrashing.

Root cause: Swap was exhausted; the kernel killed processes to recover.

Solution: vmstat 1 — high si/so means swapping. Add swap (fallocate -l 4G /swapfile && mkswap && swapon); add RAM; tune vm.swappiness.

#017 Process hung in D state (uninterruptible sleep)

Description: A process shows D in ps output and won’t respond to signals (not even SIGKILL).

Root cause: Stuck in a kernel I/O syscall, usually waiting on disk or NFS.

Solution: cat /proc/<PID>/wchan — what kernel function it’s waiting on; cat /proc/<PID>/stack for the stack trace; usually fixing the underlying I/O (NFS server, dead disk) is the only option.

#018 Too many open files in system (ENFILE)

Description: System-wide file descriptor table exhausted.

Solution: cat /proc/sys/fs/file-nr; raise fs.file-max; find the leaking process: lsof | awk '{print $2}' | sort | uniq -c | sort -rn | head.

#019 Stack overflow / pthread_create failed

Description: A multi-threaded process fails to spawn additional threads.

Solution: Per-process thread limit (ulimit -u); virtual memory limit hit; or kernel kernel.threads-max ceiling. Check /proc/sys/kernel/threads-max.

#020 Process never starts (silent fail in cron)

Description: A scheduled job appears not to run.

Solution: Check journalctl -u cron; verify $PATH in cron environment; redirect both stdout and stderr to a log: * * * * * cmd >>/var/log/myjob.log 2>&1.

Conclusion

Five habits:

  1. Always dmesg | grep -i oom after an unexplained process death.
  2. Set explicit LimitNOFILE in systemd unit files for any service that might open many sockets.
  3. Monitor vmstat 1 for swap-in/out spikes — that’s the leading indicator before OOM.
  4. Set oom_score_adj to negative values for critical daemons.
  5. Enable core dumps in production for post-mortem analysis of segfaults.

Related Linux Admin articles

Leave a Reply