ompi icon indicating copy to clipboard operation
ompi copied to clipboard

mpirun hangs intermittently

Open devreal opened this issue 3 years ago • 14 comments

I'm seeing mpirun hanging during startup on our system. Running mpirun in a loop eventually hangs, typically after a few dozen iterations:

for i in $(seq 1 100 ); do echo $i &&  mpirun -n 1 hostname ; done

The system has dual socket 64-core AMD Epyc Rome nodes connected through Infiniband ConnectX-6. I built Open MPI main with GCC 10.3.0 using the following git tags:

Open MPI: v2.x-dev-9896-g3bda0109c4 PRRTE: psrvr-v2.0.0rc1-4370-gdf7d17d0a3 PMIX: v1.1.3-3554-g6c9d3dde

My configure line is:

../configure --prefix=$HOME/opt-hawk/openmpi-main-ucx/ --with-ucx=/opt/hlrs/non-spack/mpi/openmpi/ucx/1.12.0/ --disable-man-pages --with-xpmem=$HOME/opt-hawk/xpmem --enable-debug

It appears that the more processes I spawn the higher is the chance of the hang to actually occur. I should also note that if I allocate a single node from PBS the hang does not seem to occur but if I allocate 8 nodes I can fairly reliably get to the hang even when spawning a single process. I'm not sure where to look here and which knobs to turn in order to get meaningful debug output. Any suggestions are more than welcome :)

devreal avatar Jul 19 '22 15:07 devreal

Is this system running slurm?

hppritcha avatar Jul 19 '22 16:07 hppritcha

I'll see if I can reproduce locally.

awlauria avatar Jul 19 '22 16:07 awlauria

never mind I see the PBS comment. I am seeing this type of behavior on a slurm system.

hppritcha avatar Jul 19 '22 16:07 hppritcha

Argh, I wasn't on latest main. Updated, problem persists:

Open MPI: v2.x-dev-9961-gc6dca98c71 PRRTE and PMIx are the same as above.

devreal avatar Jul 19 '22 16:07 devreal

It may be worth trying the latest prrte/pmix main, a fix may have come since the last submodule update.

awlauria avatar Jul 19 '22 17:07 awlauria

Mhh, I'm seeing build time issues with current PMIx:

  CC       prm_tm.lo
../../../../../../../3rd-party/openpmix/src/mca/prm/tm/prm_tm.c: In function ‘tm_notify’:
../../../../../../../3rd-party/openpmix/src/mca/prm/tm/prm_tm.c:54:46: error: unused parameter ‘status’ [-Werror=unused-parameter]
   54 | static pmix_status_t tm_notify(pmix_status_t status, const pmix_proc_t *source,
      |                                ~~~~~~~~~~~~~~^~~~~~
../../../../../../../3rd-party/openpmix/src/mca/prm/tm/prm_tm.c:54:73: error: unused parameter ‘source’ [-Werror=unused-parameter]
   54 | static pmix_status_t tm_notify(pmix_status_t status, const pmix_proc_t *source,
      |                                                      ~~~~~~~~~~~~~~~~~~~^~~~~~
../../../../../../../3rd-party/openpmix/src/mca/prm/tm/prm_tm.c:55:50: error: unused parameter ‘range’ [-Werror=unused-parameter]
   55 |                                pmix_data_range_t range, const pmix_info_t info[], size_t ninfo,
      |                                ~~~~~~~~~~~~~~~~~~^~~~~
../../../../../../../3rd-party/openpmix/src/mca/prm/tm/prm_tm.c:55:75: error: unused parameter ‘info’ [-Werror=unused-parameter]
   55 |                                pmix_data_range_t range, const pmix_info_t info[], size_t ninfo,
      |                                                         ~~~~~~~~~~~~~~~~~~^~~~~~
../../../../../../../3rd-party/openpmix/src/mca/prm/tm/prm_tm.c:55:90: error: unused parameter ‘ninfo’ [-Werror=unused-parameter]
   55 |                                pmix_data_range_t range, const pmix_info_t info[], size_t ninfo,
      |                                                                                   ~~~~~~~^~~~~
../../../../../../../3rd-party/openpmix/src/mca/prm/tm/prm_tm.c:56:49: error: unused parameter ‘cbfunc’ [-Werror=unused-parameter]
   56 |                                pmix_op_cbfunc_t cbfunc, void *cbdata)
      |                                ~~~~~~~~~~~~~~~~~^~~~~~
../../../../../../../3rd-party/openpmix/src/mca/prm/tm/prm_tm.c:56:63: error: unused parameter ‘cbdata’ [-Werror=unused-parameter]
   56 |                                pmix_op_cbfunc_t cbfunc, void *cbdata)
      |                                 

devreal avatar Jul 19 '22 18:07 devreal

OK, so I got the latest PMIx and PRRTE to build with CFLAGS=-Wno-unused-parameter. I can still reproduce the problem when starting a single process in a PBS job with 8 nodes. I cannot reproduce it in a job with a single or two nodes allocated.

Funny regression: I can run mpirun -n 1 -N 1 ... but setting -n 1 -N 2 leads to an error:

--------------------------------------------------------------------------
Your job has requested more processes than the ppr for
this topology can support:

  App: hostname
  Number of procs:  1
  Procs mapped:  1
  Total number of procs:  2
  PPR: 2:node

Please revise the conflict and try again.
--------------------------------------------------------------------------

devreal avatar Jul 19 '22 20:07 devreal

Seeing the same occasional hang with v5.0.x.

devreal avatar Jul 20 '22 02:07 devreal

I attached to a prterun that hung but couldn't see anything useful, other than that every thread is waiting for something to happen...

(gdb) thread apply all bt

Thread 4 (Thread 0x1481724a1700 (LWP 141944)):
#0  0x00001481741a929f in select () from /lib64/libc.so.6
#1  0x0000148175063395 in listen_thread (obj=<optimized out>) at ../../../../../../../3rd-party/prrte/src/mca/oob/tcp/oob_tcp_listener.c:602
#2  0x000014817448214a in start_thread () from /lib64/libpthread.so.0
#3  0x00001481741b1dc3 in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x1481726a2700 (LWP 141943)):
#0  0x00001481741a929f in select () from /lib64/libc.so.6
#1  0x0000148174af0522 in listen_thread (obj=<optimized out>) at ../../../../../../3rd-party/openpmix/src/mca/ptl/base/ptl_base_listener.c:167
#2  0x000014817448214a in start_thread () from /lib64/libpthread.so.0
#3  0x00001481741b1dc3 in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x1481729b5700 (LWP 141936)):
#0  0x00001481741b20f7 in epoll_wait () from /lib64/libc.so.6
#1  0x0000148174f63993 in epoll_dispatch () from /zhome/academic/HLRS/hlrs/hpcjschu/opt-hawk/openmpi-v5.0.x-ucx/lib/libevent_core-2.1.so.7
#2  0x0000148174f58e48 in event_base_loop () from /zhome/academic/HLRS/hlrs/hpcjschu/opt-hawk/openmpi-v5.0.x-ucx/lib/libevent_core-2.1.so.7
#3  0x00001481749d3f21 in progress_engine (obj=<optimized out>) at ../../../../3rd-party/openpmix/src/runtime/pmix_progress_threads.c:228
#4  0x000014817448214a in start_thread () from /lib64/libpthread.so.0
#5  0x00001481741b1dc3 in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x1481735ced80 (LWP 141929)):
#0  0x00001481741b20f7 in epoll_wait () from /lib64/libc.so.6
#1  0x0000148174f63993 in epoll_dispatch () from /zhome/academic/HLRS/hlrs/hpcjschu/opt-hawk/openmpi-v5.0.x-ucx/lib/libevent_core-2.1.so.7
#2  0x0000148174f58e48 in event_base_loop () from /zhome/academic/HLRS/hlrs/hpcjschu/opt-hawk/openmpi-v5.0.x-ucx/lib/libevent_core-2.1.so.7
#3  0x00000000004055f7 in main (argc=<optimized out>, argv=<optimized out>) at ../../../../../../3rd-party/prrte/src/tools/prte/prte.c:732

I also played around with some of the verbosity mca parameters but didn't see anything useful. Any ideas on how to debug this further?

devreal avatar Jul 21 '22 17:07 devreal

@devreal can you reproduce when calling prterun instead of mpirun? and prun? I wonder if it is a cleanup issue where something in /tmp is not getting cleaned up, or isn't getting cleaned up fast enough.

One thing you could try is after every run manually removing (in your scrpt) /tmp/prte.$HOSTNAME.* to see if that clears it up.

awlauria avatar Jul 22 '22 13:07 awlauria

can you reproduce when calling prterun instead of mpirun?

Yes, it hangs too.

and prun?

I get the following error:

prun failed to initialize, likely due to no DVM being available

Not sure what to do about that.

One thing you could try is after every run manually removing (in your scrpt) /tmp/prte.$HOSTNAME.* to see if that clears it up.

I don't see any prte files in /tmp, neither before nor after a run that hangs.

Is there a way to get extra debug output that might help dig into where the launch gets stuck?

devreal avatar Jul 22 '22 16:07 devreal

With prun you have to daemonize the prte first: prte --daemnonze. However since you hit with mpirun and prterun I'd say the odds of it not reproduceing with prun is remote.

For what it's worth I can't reproduce when using the latest and greatest prte+ pmix.

awlauria avatar Jul 27 '22 02:07 awlauria

I believe this is fixed by https://github.com/openpmix/prrte/pull/1401 and https://github.com/openpmix/prrte/pull/1403, which should now be in your main branch. Those changes are also in the PMIx v4.2 and PRRTE v3.0 branches, so they should come into OMPI v5 once updated.

rhc54 avatar Aug 09 '22 01:08 rhc54

@rhc54 provided a fix in https://github.com/openpmix/prrte/pull/1436 and ported it back to the PRTE 3.0 branch in https://github.com/openpmix/prrte/pull/1437. @awlauria can we bump the PRTE pointers for both main and 5.0.x?

devreal avatar Aug 16 '22 14:08 devreal

This has been fixed in prrte. Thanks @rhc54, closing.

devreal avatar Nov 02 '22 13:11 devreal