mpich
mpich copied to clipboard
hang in init with high ppn
MPI_Init hangs on Aurora. Reliably reproducible with nodes=700,ppn=96. Backtrace suggests it is stuck in PMIx_Fence, possibly in shm file handle sharing. Adding back full PMIx_Fence barrier during init works around the problem.
The top suggestion is to try Openpmix latest release (5) to see if the issue reproduces.
Note: while the PMIx_fence issue is not resolved, a work around is to do a PMI_Barrier at init, which prevents the PMIX_fence leak.
We should retry this with the new code that relies on shm_open rather than PMIx.