UnifyFS More reliable server bootstrapping at scale

More reliable server bootstrapping at scale

Open sandrain opened this issue 4 years ago • 2 comments

System information

Summit

Describe the problem you're observing

The unifyfsd bootstrapping (i.e., establishing connections between peers) often fails when a large number of compute nodes are used, e.g., 500+ servers.

unifyfsd relies on pmi2 or pmix to acquire job allocation information and group services like fence/barrier. However, these options seem unreliable when a job is launched with a large number of nodes (at least on summit as I have experienced). As a last resort, unifyfsd can bootstrap with peers without help from external libraries, by using a shared file system. However, there are also times that the shared file system (like GPFS) behaves badly (e.g., sudden slowdown of operations). I think we need to investigate if we can make the bootstrapping process more reliable.

Describe how to reproduce the problem

Run the unifyfsd with a large number of servers, e.g., 500+ servers.

Include any warning or errors or releveant debugging data

N/A

Nov 09 '20 18:11 sandrain

Just another datapoint here, recent runs on Summit with PMIx still start failing to bootstrap at around 512 nodes/servers. Not an every time failure, but still problematic for production use with large scale jobs.

Mar 26 '21 18:03 MichaelBrim

Adding another datapoint. On Frontier, 628 nodes, 8 ranks/node, bootstrap seems to TINEOUT consistently at unifyfs_invoke_broadcast_bootstrap_complete() call.

Apr 03 '24 16:04 wangvsa

UnifyFS UnifyFS copied to clipboard

More reliable server bootstrapping at scale

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning or errors or releveant debugging data

UnifyFS
UnifyFS copied to clipboard