Lustre Troubleshooting Guide
This guide covers common Lustre issues, error codes, debugging tools, and recovery procedures for Lustre 2.17.0 and 2.15.x (as of January 2026). It draws from the Lustre Operations Manual (updated August 2025), wiki tips, and LUG 2025 discussions (e.g., DNE3 for metadata scaling). All error codes are standard Linux errno values (negative), with context-specific meanings in subsystems like Client (file I/O, RPCs), MDS (metadata, permissions), OST (data storage), MGS (configuration), LNet (networking). Lustre does not define custom positive error codes; 0 indicates success. Errors appear in logs (/var/log/messages) prefixed with "LustreError", including function, module, PID, nodes, and RPC details. Use lctl debug_daemon for kernel debug. For recent bugs, search JIRA (e.g., LU-19758 for ext4 issues in 2025).
Common Issues and Error Codes
All codes correlate to Linux errno.h; meanings are Lustre-contextual. Source code handles errors via return -ERRNO in subsystems (e.g., ptlrpc for RPCs, obdclass for devices, ldlm for locks).
| Issue | Subsystem/Component | Error Code | Description/Troubleshooting |
|---|---|---|---|
| Operation not permitted | General (Client/MDS) | -1 (EPERM) | Permission denied; check ACLs, nodemap, or SELinux policies. |
| OST object missing/damaged | OST/Client | -2 (ENOENT) | No such file/entry; check /lost+found on OST. Salvage with LFSCK; unlink corrupted objects. |
| Operation interrupted | General | -4 (EINTR) | Interrupted by signal (e.g., CTRL-C); retry operation. |
| I/O failure | OST/Client | -5 (EIO) | Read/write failed; inspect hardware, run e2fsck on ldiskfs. Common in storage errors. |
| No such device | OST/MDS/Client | -19 (ENODEV) | Server stopped/failed over; verify with lctl device_list. |
| Invalid argument | General (API/PTLRPC) | -22 (EINVAL) | Bad parameter (e.g., lfs setstripe); validate inputs. |
| Out of space/inodes | OST/MDS | -28 (ENOSPC) | No space/inodes; use lfs df -h/-i. Migrate with lfs migrate; check open files on MDS. |
| OST read-only | OST | -30 (EROFS) | Filesystem read-only due to error; fix hardware, restart services. |
| Identifier removed | MDS/Client | -43 (EIDRM) | UID/GID mismatch; update /etc/passwd, /etc/group on MDS. |
| Not connected | LNet/Client | -107 (ENOTCONN) | Client-server disconnect; check lctl ping, peers. |
| Shutdown in progress | LNet/Client | -108 (ESHUTDOWN) | I/O during shutdown; apps may not handle; retry post-recovery. |
| Connection timed out | LNet/Client/Server | -110 (ETIMEDOUT) | RPC/network timeout; tune at_max/at_min, check network congestion. |
| Quota exceeded | Quota (MDT/OST) | -122 (EDQUOT) | Disk quota hit; use lfs quota to check/adjust limits. |
| No quota found | Quota | -3 (ESRCH) | Quotas not enabled; configure on filesystem. |
| Out of memory | General (OST/Client) | -12 (ENOMEM) | Kernel OOM; check logs, increase RAM or tune threads. |
| Bad address | API/Client | -14 (EFAULT) | Memory access issue; ensure buffers allocated properly. |
| Not supported | API/Server | -95 (ENOTSUPP) | Feature not supported (e.g., old server); check Lustre version. |
| No key | Client (File/directory encryption) | -126 (ENOKEY) | File/directory encryption key missing; use fscrypt unlock. |
| Operation not supported | MDT (ACL) | -95 (EOPNOTSUPP) | ACL not enabled; configure on MDS, or unknown RPC opcode. |
| Back in time error | MDT/OST | -5 (EIO) or -30 (EROFS) | Transaction loss, filesystem read-only; run e2fsck to repair. |
| Slow page write | Client | -110 (ETIMEDOUT) | Memory allocation delays; tune VM parameters. |
| Watchdog timeout | Any | -110 (ETIMEDOUT) | Slow ops (e.g., RAID rebuild); capture stack with lctl dk. |
| Timeouts on setup | Client/MGS/LNet | -110 (ETIMEDOUT) | Firewall/DNS issues; check port 988, hosts.deny. |
| No matching NID | LNet | -22 (EINVAL) | Config mismatch; verify networks/routes with lnetctl. |
| Mount failure | Client | -107 (ENOTCONN) or -110 (ETIMEDOUT) | Firewall/hosts.deny; check syslogs, lctl which_nid. |
| Dead routers | LNet | -107 (ENOTCONN) | Enable router_checker; set auto_down=1. |
| Asymmetric routes | LNet | -107 (ENOTCONN) | Unknown routers; check drops with lctl get_param stats. |
| Changelog overload | MDT | -28 (ENOSPC) | Purge records; lfs changelog_clear. |
| Client eviction | Client (LDLM) | -107 (ENOTCONN) | DLM/ping failures; check connectivity. |
| Server crash | OSS/MDS | -5 (EIO) | Journal replay on recovery; check for LBUG in logs. |
| Memory/lock contention | Client/Server (LDLM) | -12 (ENOMEM) | High locks; clear LRU with ldlm.namespaces.*.lru_size=clear. |
| File fragmentation | Any | -5 (EIO) | Aged FS; use filefrag -v. |
| Striping imbalance | OSTs | -28 (ENOSPC) | Variance >17%; use weighted allocation. |
| Inactive/Degraded OST | OST | -5 (EIO) or -30 (EROFS) | Deactivate; migrate data with lfs_migrate. |
| MMP conflict | Any | -22 (EINVAL) | Multiple mounts; delay mount >=10s. |
| Root squash fail | MDT | -1 (EPERM) | Untrusted clients; add to nodemap. |
| Namespace failure | MDT | -2 (ENOENT) | Subdir/fileset missing; check DNE config. |
| SELinux denial | Client | -13 (EACCES) | Policy mismatch; use l_getsepol. |
| Kerberos failure | Any (SEC) | -13 (EACCES) or -126 (ENOKEY) | Clock sync/FQDN issues; check GSS. |
| ZFS snapshot fail | OST | -110 (ETIMEDOUT) | Barrier/timeout; check inconsistencies. |
| HSM agent unresponsive | MDT (HSM) | -110 (ETIMEDOUT) | Copytools block; timeout 3600s; set mdt.*.hsm_control=disabled. |
| PCC exhaustion | Client (PCC) | -28 (ENOSPC) | Fallback to normal I/O; detach with lfs pcc detach. |
| Nodemap inconsistency | MDT | -1 (EPERM) | Unmapped NIDs squashed; modify with lctl nodemap_modify. |
| SSK key issues | Client (SEC) | -111 (ECONNREFUSED) or -22 (EINVAL) | Invalid keys/HMAC; verify with lgss_sk -r; check nodemap. |
| Performance anomalies | OST | -5 (EIO) | Faulty hardware; check variance with obdfilter-survey. |
| Data loss risk (I/O Kit) | OST/MDT | -5 (EIO) | Avoid on production; overwrites devices. |
| Script leaks | I/O Kit | -12 (ENOMEM) | Manual cleanup needed. |
| Module load fail | LNet | -2 (ENOENT) | Modprobe obdecho explicitly. |
| cYAML allocation fail | LNet | -12 (ENOMEM) | No memory for blocks/buffers. |
| Invalid network | LNet | -22 (EINVAL) | Non-existent net; check lnetctl net show. |
Debugging Tools
| Tool | Use | Key Commands |
|---|---|---|
| lctl | Admin params, debug, health | lctl get_param, debug_kernel, lfsck_start |
| lnetctl | LNet config/health | net show --verbose, import FILE.yaml |
| lfs | User/client ops, quotas, migration | df -h/i, migrate, quota |
| debugfs | Disk inspection | debugfs -c -R "stat ..." |
| strace | Syscall tracing | strace -f program |
| e2fsck/tune2fs | ldiskfs check, repair, tuning | e2fsck -f, tune2fs -O mmp |
| dumpe2fs | Emergency ldiskfs debug, recovery | dumpe2fs -h |
| llstat/llobdstat | Stats | llstat -i 1 ost |
| collectl/lltop | Monitoring | Config via collectl |
| perf | Profiling | perf top |
| lmt | Top-like monitor view | GitHub install |
| filefrag | Fragmentation | filefrag -v |
| Wireshark | Packet capture | - |
| kdump | Crash analysis | - |
| lgss_sk/keyctl | SSK configuration and debug | lgss_sk -r |
| fscrypt | File/directory encryption | fscrypt status |
| l_getsepol | SELinux | - |
| llsom_sync | Sync lazy file size to MDTs | - |
| lljobstat | JobID monitor (2.15+) | lljobstat -c 10 |
| ost-survey etc. | Benchmarking | ./ost-survey.sh |
| stats-collect | Profiling | gather_stats_everywhere.sh |
| llog_reader | Config log dump/debug | llog_reader /mnt/mgs/... |
| llverdev | Single file data integrity test | llverdev -v |
| lshowmount | Show mounted clients on server | lshowmount -v |
| lst | LNet test | lst new_session |
| cYAML utils | LNet YAML | lustre_yaml_show |
Client Troubleshooting
- Mount Failure: Check firewall/port 988;
lctl which_nid; mount with options. - Timeouts/Hangs: Set
*.at_max=1500; monitorosc.*.timeouts. - Eviction:
lctl set_param <target>.evict_client=<uuid>; reconnect flushes caches. - Striping Issues:
lfs getstripe;lfs find --component-flags=^init;lfs_migrate. - File/directory encryption:
fscrypt unlock; add key withfscrypt setup. - HSM Restore:
lfs hsm_restore; checkmdt.*.hsm.actions. - PCC:
lfs pcc detach; auto-attach with-k. - Nodemap: Remount trusted;
lctl nodemap_modify. - SSK: Verify keys
lgss_sk -r; check nodemap. - Quota:
lfs quota; migrate or setquota. - Read Cache: Disable
llite.*.max_read_ahead_mb=0. - Lock Pressure: Clear LRU
ldlm.namespaces.*.lru_size=clear; setlru_max_age=900.
Server Troubleshooting
- OST Missing: Deactivate
lctl --device <num> deactivate; salvage withLFSCK; reactivate. - Read-Only OST: Fix storage;
e2fsck -f; remount. - Degraded RAID: Set
obdfilter.OST_name.degraded=1/0. - Inactive OST/MDT: Deactivate via MGS
lctl conf_param ost_name.osc.active=0; migrate first. - Full OST: Disable creates
osp.<fs>-OST<idx>.max_create_count=0;lfs_migrate. - MDT Read-Only: Set
mdt.fs-MDT0000.readonly=1/0. - Changelog Overload:
lfs changelog_clear; limitmdd.*.changelog_mask. - LNet Misconfig:
lnetctl lnet unconfigure; delete net/peer; import YAML. - Router Health: Enable
auto_down=1,check_routers_before_use=1; set intervals. - Backup/Restore: Unmount;
ddortar --xattrs;mkfs.lustre; abort_recov for OI scrub. - Quota Config: Set global then pool
lfs setquota -u USER -B BIG_LIMIT. - HSM Control: Set
mdt.*.hsm_control=disabled; purge if needed. - DNE3 Issues (2025+): For small files/metadata overload, enable DNE3 auto-balancing; check MDT0 load.
For more, see Lustre Manual Error Numbers.