Lustre Performance Benchmarking

Lustre is a high-performance, parallel distributed file system commonly used in high-performance computing (HPC) environments for large-scale data storage and I/O operations. It consists of Metadata Servers (MDS), Object Storage Servers (OSS), and clients that access the filesystem. Benchmarking Lustre helps evaluate its performance in terms of I/O bandwidth (data transfer rates), IOPS (Input/Output Operations Per Second), and metadata operations (like file creation and deletion). This guide is designed for users with varying experience levels, including beginners. It covers key tools, detailed usage examples, tuning tips, and metrics for Lustre versions 2.17.0 and 2.15.x (as of January 2026). Always start by benchmarking the raw hardware (e.g., disks and network) to establish baselines—aim for 85-90% efficiency when running through Lustre. Refer to recent updates from the Lustre Operations Manual and discussions from LUG (Lustre User Group) 2025 on advancements like DNE3 (Distributed Namespace Enhancement 3) for automated metadata scaling.

Introduction for Beginners

If you're new to Lustre benchmarking, understand that these tests simulate real-world workloads to identify bottlenecks in storage, network, or configuration. Key concepts:

Warning: Benchmarking can generate heavy load, potentially causing data loss if not careful. Always back up important data, run tests on non-production mounts, and monitor system health (e.g., CPU, memory, disk space) during runs.

Benchmarking Tools

These tools are essential for testing different aspects of Lustre performance. For beginners, start with simple single-node tests before scaling to multiple clients.

ToolPurposeInstallation/AvailabilityBeginner Notes
IORI/O bandwidth testing (sequential/random reads/writes). Simulates large-scale data transfers.Clone from github.com/hpc/ior; build with MPI (e.g., ./configure --with-lustre; make).Easy to start with default options; focus on bandwidth metrics. Requires MPI for parallel runs.
mdtestMetadata operations (create/stat/remove). Tests how quickly the filesystem handles file management.Included with IOR; build together.Ideal for metadata-heavy workloads like simulations with many small files. Watch for ops/sec rates.
fioFlexible I/O patterns (bandwidth, IOPS, latency). Highly customizable for mixed workloads.Install via package manager (e.g., dnf install fio on RHEL).Beginner-friendly with job files; experiment with block sizes to mimic your application's I/O.
obdfilter-surveyOST performance survey (disk/network). Checks individual Object Storage Targets (OSTs).Built-in Lustre tool; run via lctl.Use to isolate slow OSTs; output shows per-thread performance.
sgpdd-surveyRaw hardware I/O. Benchmarks underlying disks without Lustre overhead.Lustre utility script in /usr/lib64/lustre/tests/.Run this first to set expectations; compare to Lustre results for efficiency.
IO500Runs IOR, mdtest, find with standard parameters to show best/worst performance envelope of system. Provides a comprehensive score for ranking storage systems.Clone from github.com/IO500/io500; build with MPI (e.g., ./prepare; make).Great for standardized comparisons; submit results to IO500 list for global rankings.
llstat / llobdstatSimple local real-time stats monitoring utilities available with every Lustre installation. Tracks I/O and metadata stats.Built-in; use lctl get_param.Like 'top' for Lustre; monitor during tests to spot live issues.
jobstatsPer-job load monitoring for each client process/job. Helps identify resource-hungry jobs.Enabled on MGS; query via lctl.Enable globally; useful in shared clusters to enforce fair usage.
lljobstatLocal tool to monitor job stats on a server (2.15+). Top-like utility to monitor RPCs sent to server to isolate high load.Built-in for recent versions.Run on servers to debug client-induced overloads.

Running Benchmarks

Prerequisites: Mount the Lustre filesystem (e.g., mount -t lustre mgsnode:/fsname /lustre), ensure NTP is synced across nodes for accurate timings, disable firewalls/SELinux temporarily for tests (restore after). Run benchmarks on multiple clients/servers for aggregate results to simulate real HPC workloads. Clear caches and stats before each run (lctl set_param -P osc.*.stats=clear; mdc.*.stats=clear) to ensure consistent measurements. For beginners, start with small-scale tests (e.g., 1-4 processes) and gradually increase.

Best Practices: Run each test 3-5 times and average results to account for variability. Use dedicated test directories to avoid interfering with other data. Document your environment (Lustre version, hardware specs) for reproducibility.

Warnings: High-thread counts can overwhelm systems—monitor temperatures and logs (e.g., dmesg). Avoid running during peak hours in shared environments. If tests fail, check for errors in /var/log/messages or Lustre debug logs (lctl debug_daemon start).

IOR Example (Bandwidth)

IOR is great for testing large, contiguous I/O. For beginners: The -a POSIX API works well with Lustre; -F creates one file per process to avoid lock contention; -C ensures collective operations for better parallelism.

# Install IOR (example on RHEL; ensure MPI is in PATH)
dnf install openmpi-devel git
git clone https://github.com/hpc/ior.git
cd ior
./bootstrap
./configure --with-lustre
make install

# Run sequential write (adjust -np for processes, -t 1m for transfer size (1MB chunks), -b 1g for block size (1GB total per process), -F for file per process, -C for collective, -e for fsync, -k keep file after test, -vv verbose, -o output file)
mpirun -np 16 ior -a POSIX -t 1m -b 1g -F -C -e -k -vv -o /lustre/testfile

Interpret: Output shows write/read rates in MB/s or GiB/s. Compare to theoretical limits (e.g., network speed * number of clients). If low, check stripe count (lfs getstripe /lustre) or network congestion.

mdtest Example (Metadata)

mdtest focuses on non-data operations. Beginners: Use -u for unique directories per process to reduce contention; -i 3 repeats the test for averaging.

# Run with IOR build (mdtest binary should be in same path)
mpirun -np 32 mdtest -d /lustre/testdir -i 3 -b 10 -z 1 -n 10000 -u -C -T -R -X

Parameters: -d test directory, -i iterations, -b branching factor (directory tree depth), -z depth, -n files per process, -u unique dirs, -C create, -T stat, -R random stat, -X remove. Results: ops/sec for each operation—higher is better. If slow, consider DNE for metadata distribution.

fio Example (IOPS)

fio allows custom workloads. For beginners: The job file defines parameters; direct=1 bypasses OS cache for true hardware tests.

# Create fio job file (random read/write; bs=4k for small blocks typical of IOPS tests)
cat <<EOF > lustre.fio
[global]
ioengine=psync  # POSIX sync I/O
direct=1  # Bypass page cache
bs=4k  # Block size
size=1g  # File size per job
numjobs=16  # Threads per client
directory=/lustre
runtime=60  # Run for 60 seconds
group_reporting=1  # Aggregate results

[randrw]
rw=randrw  # Random read/write
rwmixread=50  # 50% reads
EOF

fio lustre.fio

Results: Shows IOPS, latency (in us/ms), bandwidth. Tune bs for your workload (e.g., 4k for databases, 1m for streaming). If latency is high, check queue depths or add more servers.

obdfilter-survey Example (OST Survey)

This surveys OST performance. Beginners: Run with low threads first; case=netdisk tests both network and disk.

# Run network+disk test (adjust rslt=/tmp/survey for output, osts=4 for number of OSTs to test, thrlo=1 thrhi=4 for thread range, size=2048 for record size in KB)
sh /usr/lib64/lustre/tests/obdfilter-survey rslt=/tmp/survey osts=4 thrlo=1 thrhi=4 size=2048 case=netdisk

Output: Aggregate bandwidth, per-OST min/max rates. Use to detect faulty disks (low min) or network issues (inconsistent rates).

Monitoring During Tests

Monitoring helps correlate benchmarks with system behavior. For beginners: Start with simple commands; enable jobstats only if needed, as it adds slight overhead.

# Jobstats (enable first on all MDTs: lctl set_param -P mdt.*.job_stats=enable)
lctl get_param job_stats  # Shows per-JobID ops/bytes read/written

# Real-time stats (e.g., OST I/O every 5 seconds)
llstat -i 5 ost.OSS.ost_io

# Clear stats for fresh start
lctl set_param osc.*.stats=clear mdc.*.stats=clear

Tuning Tips

Tuning optimizes Lustre for your hardware and workload. Beginners: Change one parameter at a time and re-benchmark.

Warnings: Over-tuning can lead to instability—e.g., too many threads may cause OOM (Out of Memory) errors. Persist changes with -P flag carefully, as they affect all clients.

Key Metrics

These are typical aggregate values for large-scale systems; scale down for smaller setups.

MetricTypical Values (Aggregate)SourceBeginner Interpretation
BandwidthUp to 50 TB/sIOR/fio; llobdstatHigh values indicate good for big data transfers; if low, check striping.
IOPSUp to 225Mfio/mdtestEssential for random access; aim for application needs (e.g., 10k+ for databases).
Metadata Ops1M creates/s, 2M stats/smdtest; rpc_statsSlow metadata can bottleneck workflows; use DNE to distribute.
RPC LatencyMonitor timeouts/queue depthllstatHigh latency (>1ms) suggests overload; reduce with more servers.

Additional Resources and Troubleshooting

For more details:

If issues arise, enable debug logging (lctl set_param debug=+io), reproduce the problem, and analyze with lctl debug_file. Always update to the latest patch level for bug fixes.