Stopping a Lustre File System Tutorial
Stopping a Lustre file system safely prevents data corruption, ensures clean recovery, and allows for proper resource teardown. The correct order is: unmount clients first (to flush data), then OSS/OSTs, then MDS/MDT/MGS last. This guide is for Lustre 2.17.0 (January 2026), based on the Lustre Operations Manual (updated 2025). Use for maintenance, upgrades, or shutdown. For production, integrate with HA tools like Pacemaker.
Prerequisites
- All nodes accessible; no active I/O if possible.
- Backup data/metadata before shutdown.
- Run as root; use lctl for diagnostics.
- If HA (e.g., Pacemaker), stop cluster resources first.
Correct Shutdown Order
| Step | Component | Reason |
|---|---|---|
| 1 | Clients | Flush dirty data/locks to servers; prevent eviction. |
| 2 | OSS/OSTs | Commit OST transactions; unmount after clients. |
| 3 | MDS/MDT/MGS | Final commit; MGS last if separate. |
| 4 | Unload Modules | Free resources; optional if rebooting. |
Step-by-Step Teardown
Assumes a simple setup (e.g., from 3-node tutorial: Node3 client, Node2 OSS, Node1 MDS).
1. Unmount Clients
# On each client (e.g., Node3)
umount /mnt/testfs
# If busy: Force unmount
umount -f /mnt/testfs
# Verify no mounts
lshowmount -v # From MDS
2. Unmount OSS/OSTs
# On OSS (e.g., Node2)
umount /mnt/testfs-ost1
umount /mnt/testfs-ost2
# Or use lustre_rmmod if stuck
lustre_rmmod # Unloads modules, forces unmount
# Verify
lfs df -h # Should show no OSTs
3. Unmount MDS/MDT/MGS
# On MDS (e.g., Node1)
umount /mnt/testfs-mdt0
# Force if needed
umount -f /mnt/testfs-mdt0
4. Unload Modules and Shutdown
# On all nodes
lustre_rmmod # Or modprobe -r lustre lnet etc.
# Shutdown nodes
shutdown -h now
Best Practices for Friendly Shutdown
- Quiesce I/O: Stop applications; use lljobstat to monitor server activity.
- Flush Caches:
sync; echo 3 > /proc/sys/vm/drop_cacheson clients. - Check Recovery:
lctl get_param *.recovery_status; wait for completion. - HA Integration: Use Pacemaker to stop resources orderly.
- Avoid Abrupt: Don't kill services; use umount.
- Post-Shutdown: Run e2fsck on devices if issues suspected.
Common Issues
| Issue | Fix |
|---|---|
| Device Busy | Kill processes (fuser -m /mnt); force umount. |
| Recovery Stuck | Set lctl set_param timeout=0 temporarily. |
| Modules Stuck | Use lustre_rmmod or reboot. |
For restarts, reverse order: Start MGS/MDT, OSS/OSTs, then clients.