mpich icon indicating copy to clipboard operation
mpich copied to clipboard

hang in init with high ppn

Open raffenet opened this issue 1 year ago • 3 comments

MPI_Init hangs on Aurora. Reliably reproducible with nodes=700,ppn=96. Backtrace suggests it is stuck in PMIx_Fence, possibly in shm file handle sharing. Adding back full PMIx_Fence barrier during init works around the problem.

raffenet avatar Oct 16 '24 19:10 raffenet

The top suggestion is to try Openpmix latest release (5) to see if the issue reproduces.

hzhou avatar Oct 17 '24 14:10 hzhou

Note: while the PMIx_fence issue is not resolved, a work around is to do a PMI_Barrier at init, which prevents the PMIX_fence leak.

hzhou avatar Nov 13 '24 21:11 hzhou

We should retry this with the new code that relies on shm_open rather than PMIx.

raffenet avatar Aug 27 '25 19:08 raffenet