Linux Admin

Linux Storage & RAID Errors: mdadm degraded, LVM, SMART, multipath

Part of pathway: Linux Troubleshooting: 150 Common Errors

Storage and RAID Errors

RAID and LVM errors are the highest-stakes Linux issues — data integrity is on the line. Most of them follow a pattern: a disk returns errors, the array degrades, the rebuild begins, and either succeeds (good) or fails halfway with another disk dropping (bad). The ten errors below are what the on-call sees when storage hardware misbehaves.

#081 RAID array degraded

Solution: cat /proc/mdstat shows status; mdadm --detail /dev/md0 for full diagnosis; failed disk: mdadm /dev/md0 --remove /dev/sdc && mdadm /dev/md0 --add /dev/sde.

#082 RAID rebuild failed (second disk dropped during recovery)

Description: Worst-case scenario; array is now in inconsistent state.

Solution: Stop writes immediately. Image the failing disk with ddrescue if possible. Restore from backup. Don’t guess — this is when you call the storage vendor.

#083 LVM: Volume group not found

Solution: vgscan --cache; vgchange -ay vg_name to activate; check pvs — missing physical volume?

#084 LVM thin pool 100% full

Description: Filesystem and OS appear fine but writes fail.

Solution: lvs shows pool usage; immediate fix: extend the pool with lvextend -L+50G vg/pool; long-term: enable autoextend in /etc/lvm/lvm.conf.

#085 SMART: Pre-fail attribute

Description: smartctl -a /dev/sdX shows reallocated_sector_ct or pending_sector_count incrementing.

Solution: Replace the disk preemptively. SMART pre-fail is the warning shot before catastrophic failure.

#086 multipath device: failed all paths

Solution: multipath -ll; check fabric (FC switches, iSCSI portals); verify LUN is still mapped on the array; iscsiadm -m session -R to rescan.

#087 LVM snapshot exceeded its CoW pool

Solution: Snapshot is unusable when full. Either extend it (lvextend) before changes accumulate, or accept the loss and lvremove.

#088 mdadm: insufficient devices to start array

Description: Two disks lost from a RAID 5; not enough left for parity.

Solution: Force assemble at risk: mdadm --assemble --force /dev/md0 /dev/sda1 /dev/sdb1 /dev/sdd1. Last resort — data integrity not guaranteed.

#089 iSCSI session lost

Solution: journalctl -u iscsid; iscsiadm -m session to list; iscsiadm -m node -T target -p portal --login to reconnect.

#090 fstrim / discard not supported

Description: SSDs need TRIM to maintain performance; fstrim fails on certain filesystems/drivers.

Solution: Verify with lsblk -d -o NAME,DISC-GRAN,DISC-MAX; for thin-provisioned LVM: enable in /etc/lvm/lvm.conf (issue_discards = 1).

Conclusion

  1. Monitor mdadm --detail output via Prometheus/check_mk; a degraded array is silent without monitoring.
  2. Replace SMART pre-fail disks BEFORE they fail completely.
  3. Test backup restores quarterly. The first time you find your backup is broken should not be during an outage.
  4. Use ddrescue not dd when imaging dying disks.
  5. RAID is not backup. Snapshots aren’t backups either. Real backups go off-host.

Related Linux Admin articles

Leave a Reply