ompi
ompi copied to clipboard
mpirun hangs intermittently
I'm seeing mpirun hanging during startup on our system. Running mpirun in a loop eventually hangs, typically after a few dozen iterations:
for i in $(seq 1 100 ); do echo $i && mpirun -n 1 hostname ; done
The system has dual socket 64-core AMD Epyc Rome nodes connected through Infiniband ConnectX-6. I built Open MPI main with GCC 10.3.0 using the following git tags:
Open MPI: v2.x-dev-9896-g3bda0109c4 PRRTE: psrvr-v2.0.0rc1-4370-gdf7d17d0a3 PMIX: v1.1.3-3554-g6c9d3dde
My configure line is:
../configure --prefix=$HOME/opt-hawk/openmpi-main-ucx/ --with-ucx=/opt/hlrs/non-spack/mpi/openmpi/ucx/1.12.0/ --disable-man-pages --with-xpmem=$HOME/opt-hawk/xpmem --enable-debug
It appears that the more processes I spawn the higher is the chance of the hang to actually occur. I should also note that if I allocate a single node from PBS the hang does not seem to occur but if I allocate 8 nodes I can fairly reliably get to the hang even when spawning a single process. I'm not sure where to look here and which knobs to turn in order to get meaningful debug output. Any suggestions are more than welcome :)
Is this system running slurm?
I'll see if I can reproduce locally.
never mind I see the PBS comment. I am seeing this type of behavior on a slurm system.
Argh, I wasn't on latest main. Updated, problem persists:
Open MPI: v2.x-dev-9961-gc6dca98c71 PRRTE and PMIx are the same as above.
It may be worth trying the latest prrte/pmix main, a fix may have come since the last submodule update.
Mhh, I'm seeing build time issues with current PMIx:
CC prm_tm.lo
../../../../../../../3rd-party/openpmix/src/mca/prm/tm/prm_tm.c: In function ‘tm_notify’:
../../../../../../../3rd-party/openpmix/src/mca/prm/tm/prm_tm.c:54:46: error: unused parameter ‘status’ [-Werror=unused-parameter]
54 | static pmix_status_t tm_notify(pmix_status_t status, const pmix_proc_t *source,
| ~~~~~~~~~~~~~~^~~~~~
../../../../../../../3rd-party/openpmix/src/mca/prm/tm/prm_tm.c:54:73: error: unused parameter ‘source’ [-Werror=unused-parameter]
54 | static pmix_status_t tm_notify(pmix_status_t status, const pmix_proc_t *source,
| ~~~~~~~~~~~~~~~~~~~^~~~~~
../../../../../../../3rd-party/openpmix/src/mca/prm/tm/prm_tm.c:55:50: error: unused parameter ‘range’ [-Werror=unused-parameter]
55 | pmix_data_range_t range, const pmix_info_t info[], size_t ninfo,
| ~~~~~~~~~~~~~~~~~~^~~~~
../../../../../../../3rd-party/openpmix/src/mca/prm/tm/prm_tm.c:55:75: error: unused parameter ‘info’ [-Werror=unused-parameter]
55 | pmix_data_range_t range, const pmix_info_t info[], size_t ninfo,
| ~~~~~~~~~~~~~~~~~~^~~~~~
../../../../../../../3rd-party/openpmix/src/mca/prm/tm/prm_tm.c:55:90: error: unused parameter ‘ninfo’ [-Werror=unused-parameter]
55 | pmix_data_range_t range, const pmix_info_t info[], size_t ninfo,
| ~~~~~~~^~~~~
../../../../../../../3rd-party/openpmix/src/mca/prm/tm/prm_tm.c:56:49: error: unused parameter ‘cbfunc’ [-Werror=unused-parameter]
56 | pmix_op_cbfunc_t cbfunc, void *cbdata)
| ~~~~~~~~~~~~~~~~~^~~~~~
../../../../../../../3rd-party/openpmix/src/mca/prm/tm/prm_tm.c:56:63: error: unused parameter ‘cbdata’ [-Werror=unused-parameter]
56 | pmix_op_cbfunc_t cbfunc, void *cbdata)
|
OK, so I got the latest PMIx and PRRTE to build with CFLAGS=-Wno-unused-parameter. I can still reproduce the problem when starting a single process in a PBS job with 8 nodes. I cannot reproduce it in a job with a single or two nodes allocated.
Funny regression: I can run mpirun -n 1 -N 1 ... but setting -n 1 -N 2 leads to an error:
--------------------------------------------------------------------------
Your job has requested more processes than the ppr for
this topology can support:
App: hostname
Number of procs: 1
Procs mapped: 1
Total number of procs: 2
PPR: 2:node
Please revise the conflict and try again.
--------------------------------------------------------------------------
Seeing the same occasional hang with v5.0.x.
I attached to a prterun that hung but couldn't see anything useful, other than that every thread is waiting for something to happen...
(gdb) thread apply all bt
Thread 4 (Thread 0x1481724a1700 (LWP 141944)):
#0 0x00001481741a929f in select () from /lib64/libc.so.6
#1 0x0000148175063395 in listen_thread (obj=<optimized out>) at ../../../../../../../3rd-party/prrte/src/mca/oob/tcp/oob_tcp_listener.c:602
#2 0x000014817448214a in start_thread () from /lib64/libpthread.so.0
#3 0x00001481741b1dc3 in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x1481726a2700 (LWP 141943)):
#0 0x00001481741a929f in select () from /lib64/libc.so.6
#1 0x0000148174af0522 in listen_thread (obj=<optimized out>) at ../../../../../../3rd-party/openpmix/src/mca/ptl/base/ptl_base_listener.c:167
#2 0x000014817448214a in start_thread () from /lib64/libpthread.so.0
#3 0x00001481741b1dc3 in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x1481729b5700 (LWP 141936)):
#0 0x00001481741b20f7 in epoll_wait () from /lib64/libc.so.6
#1 0x0000148174f63993 in epoll_dispatch () from /zhome/academic/HLRS/hlrs/hpcjschu/opt-hawk/openmpi-v5.0.x-ucx/lib/libevent_core-2.1.so.7
#2 0x0000148174f58e48 in event_base_loop () from /zhome/academic/HLRS/hlrs/hpcjschu/opt-hawk/openmpi-v5.0.x-ucx/lib/libevent_core-2.1.so.7
#3 0x00001481749d3f21 in progress_engine (obj=<optimized out>) at ../../../../3rd-party/openpmix/src/runtime/pmix_progress_threads.c:228
#4 0x000014817448214a in start_thread () from /lib64/libpthread.so.0
#5 0x00001481741b1dc3 in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x1481735ced80 (LWP 141929)):
#0 0x00001481741b20f7 in epoll_wait () from /lib64/libc.so.6
#1 0x0000148174f63993 in epoll_dispatch () from /zhome/academic/HLRS/hlrs/hpcjschu/opt-hawk/openmpi-v5.0.x-ucx/lib/libevent_core-2.1.so.7
#2 0x0000148174f58e48 in event_base_loop () from /zhome/academic/HLRS/hlrs/hpcjschu/opt-hawk/openmpi-v5.0.x-ucx/lib/libevent_core-2.1.so.7
#3 0x00000000004055f7 in main (argc=<optimized out>, argv=<optimized out>) at ../../../../../../3rd-party/prrte/src/tools/prte/prte.c:732
I also played around with some of the verbosity mca parameters but didn't see anything useful. Any ideas on how to debug this further?
@devreal can you reproduce when calling prterun instead of mpirun? and prun? I wonder if it is a cleanup issue where something in /tmp is not getting cleaned up, or isn't getting cleaned up fast enough.
One thing you could try is after every run manually removing (in your scrpt) /tmp/prte.$HOSTNAME.* to see if that clears it up.
can you reproduce when calling prterun instead of mpirun?
Yes, it hangs too.
and prun?
I get the following error:
prun failed to initialize, likely due to no DVM being available
Not sure what to do about that.
One thing you could try is after every run manually removing (in your scrpt) /tmp/prte.$HOSTNAME.* to see if that clears it up.
I don't see any prte files in /tmp, neither before nor after a run that hangs.
Is there a way to get extra debug output that might help dig into where the launch gets stuck?
With prun you have to daemonize the prte first: prte --daemnonze. However since you hit with mpirun and prterun I'd say the odds of it not reproduceing with prun is remote.
For what it's worth I can't reproduce when using the latest and greatest prte+ pmix.
I believe this is fixed by https://github.com/openpmix/prrte/pull/1401 and https://github.com/openpmix/prrte/pull/1403, which should now be in your main branch. Those changes are also in the PMIx v4.2 and PRRTE v3.0 branches, so they should come into OMPI v5 once updated.
@rhc54 provided a fix in https://github.com/openpmix/prrte/pull/1436 and ported it back to the PRTE 3.0 branch in https://github.com/openpmix/prrte/pull/1437. @awlauria can we bump the PRTE pointers for both main and 5.0.x?
This has been fixed in prrte. Thanks @rhc54, closing.