ompi mpirun hangs intermittently

I'm seeing mpirun hanging during startup on our system. Running mpirun in a loop eventually hangs, typically after a few dozen iterations:

for i in $(seq 1 100 ); do echo $i &&  mpirun -n 1 hostname ; done

The system has dual socket 64-core AMD Epyc Rome nodes connected through Infiniband ConnectX-6. I built Open MPI main with GCC 10.3.0 using the following git tags:

Open MPI: v2.x-dev-9896-g3bda0109c4 PRRTE: psrvr-v2.0.0rc1-4370-gdf7d17d0a3 PMIX: v1.1.3-3554-g6c9d3dde

My configure line is:

../configure --prefix=$HOME/opt-hawk/openmpi-main-ucx/ --with-ucx=/opt/hlrs/non-spack/mpi/openmpi/ucx/1.12.0/ --disable-man-pages --with-xpmem=$HOME/opt-hawk/xpmem --enable-debug

It appears that the more processes I spawn the higher is the chance of the hang to actually occur. I should also note that if I allocate a single node from PBS the hang does not seem to occur but if I allocate 8 nodes I can fairly reliably get to the hang even when spawning a single process. I'm not sure where to look here and which knobs to turn in order to get meaningful debug output. Any suggestions are more than welcome :)

Jul 19 '22 15:07 devreal

Is this system running slurm?

Jul 19 '22 16:07 hppritcha

I'll see if I can reproduce locally.

Jul 19 '22 16:07 awlauria

never mind I see the PBS comment. I am seeing this type of behavior on a slurm system.

Jul 19 '22 16:07 hppritcha

Argh, I wasn't on latest main. Updated, problem persists:

Open MPI: v2.x-dev-9961-gc6dca98c71 PRRTE and PMIx are the same as above.

Jul 19 '22 16:07 devreal

It may be worth trying the latest prrte/pmix main, a fix may have come since the last submodule update.

Jul 19 '22 17:07 awlauria

Mhh, I'm seeing build time issues with current PMIx:

  CC       prm_tm.lo
../../../../../../../3rd-party/openpmix/src/mca/prm/tm/prm_tm.c: In function ‘tm_notify’:
../../../../../../../3rd-party/openpmix/src/mca/prm/tm/prm_tm.c:54:46: error: unused parameter ‘status’ [-Werror=unused-parameter]
   54 | static pmix_status_t tm_notify(pmix_status_t status, const pmix_proc_t *source,
      |                                ~~~~~~~~~~~~~~^~~~~~
../../../../../../../3rd-party/openpmix/src/mca/prm/tm/prm_tm.c:54:73: error: unused parameter ‘source’ [-Werror=unused-parameter]
   54 | static pmix_status_t tm_notify(pmix_status_t status, const pmix_proc_t *source,
      |                                                      ~~~~~~~~~~~~~~~~~~~^~~~~~
../../../../../../../3rd-party/openpmix/src/mca/prm/tm/prm_tm.c:55:50: error: unused parameter ‘range’ [-Werror=unused-parameter]
   55 |                                pmix_data_range_t range, const pmix_info_t info[], size_t ninfo,
      |                                ~~~~~~~~~~~~~~~~~~^~~~~
../../../../../../../3rd-party/openpmix/src/mca/prm/tm/prm_tm.c:55:75: error: unused parameter ‘info’ [-Werror=unused-parameter]
   55 |                                pmix_data_range_t range, const pmix_info_t info[], size_t ninfo,
      |                                                         ~~~~~~~~~~~~~~~~~~^~~~~~
../../../../../../../3rd-party/openpmix/src/mca/prm/tm/prm_tm.c:55:90: error: unused parameter ‘ninfo’ [-Werror=unused-parameter]
   55 |                                pmix_data_range_t range, const pmix_info_t info[], size_t ninfo,
      |                                                                                   ~~~~~~~^~~~~
../../../../../../../3rd-party/openpmix/src/mca/prm/tm/prm_tm.c:56:49: error: unused parameter ‘cbfunc’ [-Werror=unused-parameter]
   56 |                                pmix_op_cbfunc_t cbfunc, void *cbdata)
      |                                ~~~~~~~~~~~~~~~~~^~~~~~
../../../../../../../3rd-party/openpmix/src/mca/prm/tm/prm_tm.c:56:63: error: unused parameter ‘cbdata’ [-Werror=unused-parameter]
   56 |                                pmix_op_cbfunc_t cbfunc, void *cbdata)
      |

Jul 19 '22 18:07 devreal

OK, so I got the latest PMIx and PRRTE to build with CFLAGS=-Wno-unused-parameter. I can still reproduce the problem when starting a single process in a PBS job with 8 nodes. I cannot reproduce it in a job with a single or two nodes allocated.

Funny regression: I can run mpirun -n 1 -N 1 ... but setting -n 1 -N 2 leads to an error:

--------------------------------------------------------------------------
Your job has requested more processes than the ppr for
this topology can support:

  App: hostname
  Number of procs:  1
  Procs mapped:  1
  Total number of procs:  2
  PPR: 2:node

Please revise the conflict and try again.
--------------------------------------------------------------------------

Jul 19 '22 20:07 devreal

Seeing the same occasional hang with v5.0.x.

Jul 20 '22 02:07 devreal

I attached to a prterun that hung but couldn't see anything useful, other than that every thread is waiting for something to happen...

(gdb) thread apply all bt

Thread 4 (Thread 0x1481724a1700 (LWP 141944)):
#0  0x00001481741a929f in select () from /lib64/libc.so.6
#1  0x0000148175063395 in listen_thread (obj=<optimized out>) at ../../../../../../../3rd-party/prrte/src/mca/oob/tcp/oob_tcp_listener.c:602
#2  0x000014817448214a in start_thread () from /lib64/libpthread.so.0
#3  0x00001481741b1dc3 in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x1481726a2700 (LWP 141943)):
#0  0x00001481741a929f in select () from /lib64/libc.so.6
#1  0x0000148174af0522 in listen_thread (obj=<optimized out>) at ../../../../../../3rd-party/openpmix/src/mca/ptl/base/ptl_base_listener.c:167
#2  0x000014817448214a in start_thread () from /lib64/libpthread.so.0
#3  0x00001481741b1dc3 in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x1481729b5700 (LWP 141936)):
#0  0x00001481741b20f7 in epoll_wait () from /lib64/libc.so.6
#1  0x0000148174f63993 in epoll_dispatch () from /zhome/academic/HLRS/hlrs/hpcjschu/opt-hawk/openmpi-v5.0.x-ucx/lib/libevent_core-2.1.so.7
#2  0x0000148174f58e48 in event_base_loop () from /zhome/academic/HLRS/hlrs/hpcjschu/opt-hawk/openmpi-v5.0.x-ucx/lib/libevent_core-2.1.so.7
#3  0x00001481749d3f21 in progress_engine (obj=<optimized out>) at ../../../../3rd-party/openpmix/src/runtime/pmix_progress_threads.c:228
#4  0x000014817448214a in start_thread () from /lib64/libpthread.so.0
#5  0x00001481741b1dc3 in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x1481735ced80 (LWP 141929)):
#0  0x00001481741b20f7 in epoll_wait () from /lib64/libc.so.6
#1  0x0000148174f63993 in epoll_dispatch () from /zhome/academic/HLRS/hlrs/hpcjschu/opt-hawk/openmpi-v5.0.x-ucx/lib/libevent_core-2.1.so.7
#2  0x0000148174f58e48 in event_base_loop () from /zhome/academic/HLRS/hlrs/hpcjschu/opt-hawk/openmpi-v5.0.x-ucx/lib/libevent_core-2.1.so.7
#3  0x00000000004055f7 in main (argc=<optimized out>, argv=<optimized out>) at ../../../../../../3rd-party/prrte/src/tools/prte/prte.c:732

I also played around with some of the verbosity mca parameters but didn't see anything useful. Any ideas on how to debug this further?

Jul 21 '22 17:07 devreal

@devreal can you reproduce when calling prterun instead of mpirun? and prun? I wonder if it is a cleanup issue where something in /tmp is not getting cleaned up, or isn't getting cleaned up fast enough.

One thing you could try is after every run manually removing (in your scrpt) /tmp/prte.$HOSTNAME.* to see if that clears it up.

Jul 22 '22 13:07 awlauria

can you reproduce when calling prterun instead of mpirun?

Yes, it hangs too.

and prun?

I get the following error:

prun failed to initialize, likely due to no DVM being available

Not sure what to do about that.

One thing you could try is after every run manually removing (in your scrpt) /tmp/prte.$HOSTNAME.* to see if that clears it up.

I don't see any prte files in /tmp, neither before nor after a run that hangs.

Is there a way to get extra debug output that might help dig into where the launch gets stuck?

Jul 22 '22 16:07 devreal

With prun you have to daemonize the prte first: prte --daemnonze. However since you hit with mpirun and prterun I'd say the odds of it not reproduceing with prun is remote.

For what it's worth I can't reproduce when using the latest and greatest prte+ pmix.

Jul 27 '22 02:07 awlauria

I believe this is fixed by https://github.com/openpmix/prrte/pull/1401 and https://github.com/openpmix/prrte/pull/1403, which should now be in your main branch. Those changes are also in the PMIx v4.2 and PRRTE v3.0 branches, so they should come into OMPI v5 once updated.

Aug 09 '22 01:08 rhc54

@rhc54 provided a fix in https://github.com/openpmix/prrte/pull/1436 and ported it back to the PRTE 3.0 branch in https://github.com/openpmix/prrte/pull/1437. @awlauria can we bump the PRTE pointers for both main and 5.0.x?

Aug 16 '22 14:08 devreal

This has been fixed in prrte. Thanks @rhc54, closing.

Nov 02 '22 13:11 devreal

ompi ompi copied to clipboard

mpirun hangs intermittently

ompi
ompi copied to clipboard