Lustre High Availability
Lustre achieves high availability (HA) through mechanisms like failover, multi-rail networking, file-level redundancy (FLR), and integration with clustering tools such as Pacemaker. High availability ensures that the filesystem remains accessible and operational even if components fail, minimizing downtime in critical environments like high-performance computing (HPC) clusters. This guide covers core concepts, setup procedures, best practices, commands, and updates for Lustre 2.17.0 (released in December 2025, current as of January 2026), based on the Lustre Operations Manual (updated 2025). For beginners: Lustre HA is essential in large-scale systems where a single failure could halt operations for thousands of nodes. Always plan HA during initial deployment to avoid retrofitting challenges.
Warnings:
- HA configurations involve shared storage and failover; improper setup can lead to data corruption or loss. Always test in a non-production environment.
- Fencing (STONITH) is critical to prevent split-brain scenarios where both nodes think they own the resource, potentially causing filesystem damage.
- Lustre does not provide built-in data replication across sites; rely on underlying storage RAID for disk failures.
- Changes to configurations (e.g., adding failovers) require unmounting and remounting targets, causing brief downtime.
- Ensure all nodes have synchronized clocks (NTP) to avoid recovery issues from timestamp mismatches.
Additional Best Practices:
- Monitor system health regularly using tools like lnetctl, lfs df, and Prometheus exporters for Lustre metrics.
- Implement regular backups of metadata (MDT) using tools like lfs snapshot or external backups.
- Use virtual IPs for service nodes to simplify client configurations during failovers.
- Test failover scenarios periodically to ensure quick recovery times (aim for under 1 minute).
- Document your HA setup, including node NIDs, storage mappings, and recovery procedures.
Core Concepts
Understanding these concepts is key for beginners: Lustre separates metadata (file names, permissions) from data (file contents), allowing independent scaling and failover. HA focuses on redundancy at each layer to eliminate single points of failure.
| Concept | Description |
|---|---|
| Failover | Automatic switch to a backup component (e.g., MDT or OST) when the primary fails. This eliminates single points of failure (SPOF). Supports active/passive (one active, one idle) and active/active (both serving load) modes. Requires shared storage like SAN or RAID arrays. For beginners: Think of it like a backup generator that kicks in during a power outage. |
| Failover Configurations | - Active/Passive: One node active, the other on standby. Simple but underutilizes hardware. - Active/Active: Both nodes serve traffic, with resources split (e.g., half OSTs each). Maximizes utilization but requires careful load balancing. - Failout Mode: Returns errors (EIO) instead of blocking I/O during failures—use sparingly as it can disrupt applications. - Failback: Automatically or manually returns control to the primary after recovery. |
| Failover for Components | - MDT Failover: Multiple Metadata Servers (MDS) share MDT storage. Typically active/passive, but active/active possible with different MDTs per MDS. - OST Failover: Multiple Object Storage Servers (OSS) share OST storage. Often active/active with OSTs distributed. - Failover NIDs: Use --servicenode for primary + backups or --failnode for backups only. Ensures clients know where to reconnect. |
| Multi-Rail | Uses multiple network interfaces per node for redundancy and bandwidth. Supports hardware (e.g., InfiniBand RDMA) and software (LNet since 2.10). Provides fault tolerance via routing and health checks. Beginners: Like having multiple lanes on a highway—if one clogs, traffic reroutes. |
| FLR (File Level Redundancy) | Introduced in 2.11, mirrors files across OSTs for data protection and faster reads. Primary mirror updates first; others resync later. Great for critical data. Warning: Resync can consume bandwidth—schedule during low-usage periods. |
| MMP (Multiple-Mount Protection) | Prevents corruption from simultaneous mounts on shared storage. Uses sequence numbers and delays mounts by 10 seconds. Enable on all HA targets. Essential for shared block devices. |
| LNet HA | Supports diverse networks (TCP, InfiniBand). Features routing, dynamic discovery (2.10+), asymmetrical routes (2.13), and health monitoring (2.12+). Ensures network resilience. |
| LNet Health | Scores interfaces (0-1000). Drops on failures (e.g., resends); recovers via pings. Types include local/remote resends. Tune for sensitive environments. |
| Integration with Pacemaker | Use with Corosync for cluster management. Handles detection, fencing (STONITH via IPMI/BMC), and failover. Requires PowerMan for power control. Beginners: Pacemaker is like an orchestra conductor ensuring all parts play in harmony. |
| Imperative Recovery (IR) | Clients drive recovery after failures. MGS notifies status. Tune with ir_factor and recovery timeouts for faster resumption. |
| Version-Based Recovery (VBR) | Uses inode versions to bridge replay gaps, reducing evictions. Improves stability in flaky networks. |
| Commit on Share (COS) | Commits transactions to disk to avoid chain evictions. Enable via mdt.*.commit_on_sharing. |
| Client Eviction | Removes slow clients to protect servers. Invalidates locks; triggered by failed pings. |
| Metadata Replay | Replays missed requests post-failover using XIDs and transaction numbers. |
| Reply Reconstruction | Rebuilds lost replies from XID, transaction, and results. |
| DNE (Distributed Namespace Environment) | Spreads metadata across MDTs. DNE1 (remote dirs), DNE2 (striped dirs, 2.8+). Scales metadata performance. |
| PFL (Progressive File Layout) | Composite layouts with delayed instantiation (2.10). Adapts striping as files grow. |
| SEL (Self-Extending Layout) | Extends PFL (2.13). Auto-swaps components on space issues. Policies: extend, spillover, etc. |
| DoM (Data on MDT) | Stores small files on MDT (2.12). Use -L mdt; limited by dom_stripesize (1MB default). |
| LSoM (Lazy Size on MDT) | Lazily updates file sizes on MDT (2.13). Resync via open/close or background. |
| OST Pools | Groups OSTs (e.g., by type or location). Used in FLR for domains. Create with lctl pool_new/add. |
Setup
Setting up HA requires careful planning. Start with hardware: Ensure shared storage is reliable. Install Lustre servers first (see installation guides). Use root access for commands.
Failover Setup
Hardware requirements: Shared storage via SAN or RAID. Use RAID 1/10 for MDT (metadata-critical), RAID 5/6 for OST (data). Avoid running clients on MDS/OSS nodes to prevent resource contention. HA software like Pacemaker handles node failures via fencing.
# Format with failover
# Explanation: Specifies service NIDs for primary and failover nodes during formatting.
mkfs.lustre --servicenode=nid1,nid2 --ost ... /dev/sdX
# Update existing target
# Explanation: Adds or modifies failover NIDs on an unmounted device.
tunefs.lustre --servicenode=nid1,nid2 /dev/sdX
# Mount with multiple nodes
# Explanation: Clients mount using all possible NIDs for automatic failover.
mount -t lustre nid1:nid2@tcp:/fs /mnt/ost1
# Enable failout mode (rare)
# Explanation: Returns errors instead of blocking—use only if applications handle EIO gracefully.
mkfs.lustre --param=failover.mode=failout --ost ... /dev/sdX
Multi-Rail Setup
Configure multiple interfaces for redundancy. Beginners: LNet is Lustre's networking layer; multi-rail bonds them logically.
# LNet Configuration (Lustre 2.7+)
# Explanation: Loads LNet and adds networks with multiple interfaces.
modprobe lnet
lnetctl lnet configure
lnetctl net add --net tcp0 --if eth0,eth1
lnetctl peer add --prim_nid nid --nid nid1,nid2
lnetctl route add --net tcp2 --gateway nid
# YAML Import/Export
# Explanation: Use YAML for consistent configs across nodes.
lnetctl yaml_import config.yaml
lnetctl yaml_export config.yaml
# Dynamic Discovery (2.11+)
# Explanation: Automatically discovers peer interfaces.
lnetctl set discovery 1
lnetctl discover <peer_nid>
# Asymmetrical Routes (2.13)
# Explanation: Drops asymmetric routes for security.
lnetctl set drop_asym_route 1
FLR Setup
File mirroring for redundancy. Place mirrors in separate failure domains (e.g., different OSSs).
# Create Mirrored File
# Explanation: Creates a file with 2 mirrors, 4M stripe, etc. -p specifies pools.
lfs mirror create -N 2 -S 4M -c 2 -p flash -N -c -1 -p archive /mnt/lustre/file1
# Mirror Operations
# Explanation: Extend adds mirrors; split removes; resync updates stale ones; verify checks integrity.
lfs mirror extend -N 2 /mnt/lustre/file1
lfs mirror split --mirror-id 1 -d /mnt/lustre/file1
lfs mirror resync /mnt/lustre/file1
lfs mirror verify -v --only 1 /mnt/lustre/file1
lfs mirror find dir --mirror-count +1
# OST Pools
# Explanation: Groups OSTs for targeted placement.
lctl pool_new testfs.pool1
lctl pool_add testfs.pool1 OST[0-10]
MMP Setup
Protects against double-mounts. Always enable on shared devices.
# Enable
# Explanation: Adds MMP feature to the filesystem.
tune2fs -O mmp /dev/sdX
# Disable
# Explanation: Removes MMP (not recommended for HA).
tune2fs -O ^mmp /dev/sdX
# Check
# Explanation: Verifies MMP status.
e2mmpstatus /dev/sdX
LNet Health Setup
Tune for proactive failure detection.
lnetctl set health_sensitivity 100
lnetctl set recovery_interval 1
lnetctl set transaction_timeout 30
Best Practices
These expand on core recommendations for reliable operations.
| Area | Recommendations |
|---|---|
| Storage | Use RAID 6 for OSTs with hot spares; monitor via mdadm. Back up MDT metadata weekly. Overprovision space by 20% for overhead. |
| HA | Deploy UPS for power stability. Use Pacemaker with fencing. Disable writeback cache on HBAs. Allocate ample RAM for journals (MDS: 128GB+, OSS: 64GB+). |
| Network | Dedicate NICs to Lustre; bond for redundancy. Use NTP. Isolate from client access if possible. |
| Multi-Rail | Prefer RDMA. Order ip2nets rules carefully. Enable auto_down and router checks. Test with lnetctl ping. |
| FLR | Mirrors in different domains (OSTs, OSSs, racks). Use 'prefer' for faster media. Schedule resyncs off-peak. |
| MMP | Always enable on HA targets to prevent corruption. |
| LNet | Use IPs over hostnames. Minimize lustre.conf comments. Validate configs with lnetctl show. |
| General | Overprovision MDT inodes/space. Monitor with lfs df -i. Integrate with monitoring tools like Nagios. |
Commands
Failover & Configuration
# Mark as degraded (testing)
lctl set_param obdfilter.*.degraded=1
LNet Management (lnetctl)
| Function | Command |
|---|---|
| Networks | lnetctl net add/del/show --net tcp0 --if eth0 |
| Peers | lnetctl peer add/del/show --prim_nid nid --nid nid1,nid2 |
| Routes | lnetctl route add/del/show --net tcp2 --gateway nid |
| Routing | lnetctl set routing 1 |
| Health | lnetctl show health |
| Discovery | lnetctl set discovery 1 |
| Asym Routes | lnetctl set drop_asym_route 1 |
| YAML | lnetctl yaml_import/del/show file.yaml |
| Stats | lnetctl show_stats |
FLR Operations
lfs mirror create -N 2 file
lfs mirror extend -N --pool hdd file
lfs mirror split --mirror-id 1 file
lfs mirror resync file
lfs mirror verify file
lfs mirror find dir --mirror-count +1
Recent Updates (Lustre 2.8–2.17+)
Lustre evolves rapidly; check release notes for your version.
| Version | Key HA Features |
|---|---|
| 2.8 | Striped Directories (DNE2), Multiple reply data per client. |
| 2.9 | llapi_path2fid, llapi_ladvise, recovery_time_soft/hard. |
| 2.10 | Software multi-rail, YAML config, dynamic discovery, PFL, FLR intro. |
| 2.11 | FLR operations, lfs mirror, llsom_sync, multi-rail routing, default quotas. |
| 2.12 | LNet health monitoring, DoM, dom_stripesize, tab completion for lctl. |
| 2.13 | Asymmetrical routes, SEL, LSoM, pool quotas. |
| 2.14 | Jobstats, default dir striping, pool quotas interoperability, client encryption. |
| 2.15 | Nodemap project IDs, lljobstat. |
| 2.16 | Adaptive timeouts, del_ost, session-based JobID, lod.*.max_mdt_stripecount. |
| 2.17 | Nodemap ID offset ranges. |
Limitations & Dependencies
- No Built-in Data Redundancy: Relies on backing storage (RAID). OST failover does not protect disk failures—use hardware RAID.
- Third-Party HA: Lustre provides filesystem-level failover; external tools (Pacemaker) handle node/system-level. Ensure compatibility with your distro (e.g., RHEL, SLES).
Additional Tips
For troubleshooting: Enable debug logging with lctl set_param debug=+ha. Join Lustre mailing lists for community support. Consider training or certification for complex setups. In cloud environments, use managed services for shared storage if available. Always benchmark post-setup with tools like IOR or mdtest to verify performance under HA.