UnifyFS icon indicating copy to clipboard operation
UnifyFS copied to clipboard

More reliable server bootstrapping at scale

Open sandrain opened this issue 4 years ago • 2 comments

System information

Summit

Describe the problem you're observing

The unifyfsd bootstrapping (i.e., establishing connections between peers) often fails when a large number of compute nodes are used, e.g., 500+ servers.

unifyfsd relies on pmi2 or pmix to acquire job allocation information and group services like fence/barrier. However, these options seem unreliable when a job is launched with a large number of nodes (at least on summit as I have experienced). As a last resort, unifyfsd can bootstrap with peers without help from external libraries, by using a shared file system. However, there are also times that the shared file system (like GPFS) behaves badly (e.g., sudden slowdown of operations). I think we need to investigate if we can make the bootstrapping process more reliable.

Describe how to reproduce the problem

Run the unifyfsd with a large number of servers, e.g., 500+ servers.

Include any warning or errors or releveant debugging data

N/A

sandrain avatar Nov 09 '20 18:11 sandrain

Just another datapoint here, recent runs on Summit with PMIx still start failing to bootstrap at around 512 nodes/servers. Not an every time failure, but still problematic for production use with large scale jobs.

MichaelBrim avatar Mar 26 '21 18:03 MichaelBrim

Adding another datapoint. On Frontier, 628 nodes, 8 ranks/node, bootstrap seems to TINEOUT consistently at unifyfs_invoke_broadcast_bootstrap_complete() call.

wangvsa avatar Apr 03 '24 16:04 wangvsa