ompi icon indicating copy to clipboard operation
ompi copied to clipboard

occasional deadlock on mpi_barrier; what to do?

Open gregfi opened this issue 6 months ago • 0 comments

My application gets sporadically deadlocked on mpi_barrier calls. It seems to happen at times when the network is under very heavy load and/or the machines are being oversubscribed. (I don't have any control over that.) This application is running OpenMPI 4.1.4 on SuSE Linux 12. My admins attached a debugger and printed a back trace to all the running processes, and the result is

PID 62880:

Using host libthread_db library "/lib64/libthread_db.so.1".
0x00002aead70f655d in poll () from /lib64/libc.so.6 
#0  0x00002aead70f655d in poll () from /lib64/libc.so.6
#1  0x00002aeae04d504e in poll_dispatch (base=0x2fd2ba0, tv=0x12) at ../../../../.././openmpi-4.1.4/opal/mca/event/libevent2022/libevent/poll.c:165
#2  0x00002aeae04c9881 in opal_libevent2022_event_base_loop (base=0x2fd2ba0, flags=18) at ../../../../.././openmpi-4.1.4/opal/mca/event/libevent2022/libevent/event.c:1630
#3  0x00002aeae047254e in opal_progress () from /tools/openmpi/4.1.4/lib/libopen-pal.so.40
#4  0x00002aeaf0818d74 in mca_pml_ob1_send () from /tools/openmpi/4.1.4/lib/openmpi/mca_pml_ob1.so
#5  0x00002aead6b51f9d in ompi_coll_base_barrier_intra_recursivedoubling () from /tools/openmpi/4.1.4/lib/libmpi.so.40
#6  0x00002aead6b03e11 in PMPI_Barrier () from /tools/openmpi/4.1.4/lib/libmpi.so.40
#7  0x00002aead6877f43 in pmpi_barrier__ () from /tools/openmpi/4.1.4/lib/libmpi_mpifh.so.40
#8  0x00002aead6400be2 in mpi_barrier_f08_ () from /tools/openmpi/4.1.4/lib/libmpi_usempif08.so.40
#9  0x00000000005bb177 in (same location in application code)

All other PIDs:

Using host libthread_db library "/lib64/libthread_db.so.1".
0x00002afe6e6c355d in poll () from /lib64/libc.so.6 
#0  0x00002afe6e6c355d in poll () from /lib64/libc.so.6
#1  0x00002afe77aa204e in poll_dispatch (base=0x119bbe0, tv=0x9) at ../../../../.././openmpi-4.1.4/opal/mca/event/libevent2022/libevent/poll.c:165
#2  0x00002afe77a96881 in opal_libevent2022_event_base_loop (base=0x119bbe0, flags=9) at ../../../../.././openmpi-4.1.4/opal/mca/event/libevent2022/libevent/event.c:1630
#3  0x00002afe77a3f54e in opal_progress () from /tools/openmpi/4.1.4/lib/libopen-pal.so.40
#4  0x00002afe6e0ba42b in ompi_request_default_wait () from /tools/openmpi/4.1.4/lib/libmpi.so.40
#5  0x00002afe6e11ef0e in ompi_coll_base_barrier_intra_recursivedoubling () from /tools/openmpi/4.1.4/lib/libmpi.so.40
#6  0x00002afe6e0d0e11 in PMPI_Barrier () from /tools/openmpi/4.1.4/lib/libmpi.so.40
#7  0x00002afe6de44f43 in pmpi_barrier__ () from /tools/openmpi/4.1.4/lib/libmpi_mpifh.so.40
#8  0x00002afe6d9cdbe2 in mpi_barrier_f08_ () from /tools/openmpi/4.1.4/lib/libmpi_usempif08.so.40
#9  0x00000000005bb177 in (same location in application code)

What I see: PID 62880 is waiting on mca_pml_ob1_send; the others are on ompi_request_default_wait.

This problem only occurs sporadically. I chewed through ~10,000 core-hours this weekend trying to reproduce the issue and failed - likely because the system was less loaded over the weekend. The jobs are being run with -map-by socket --bind-to socket --rank-by core --mca btl_tcp_if_include 10.216.0.0/16 in order to force all traffic over a single interface.

Also, puzzlingly, I see the following printed to stderr:

[hpap14n4:08897] mca_base_component_repository_open: unable to open mca_fs_lustre: liblustreapi.so: cannot open shared object file: No such file or directory (ignored)
[hpap14n4:08911] mca_base_component_repository_open: unable to open mca_fs_lustre: liblustreapi.so: cannot open shared object file: No such file or directory (ignored)
[hpap14n4:08913] mca_base_component_repository_open: unable to open mca_fs_lustre: liblustreapi.so: cannot open shared object file: No such file or directory (ignored)
[hpap14n4:08929] mca_base_component_repository_open: unable to open mca_fs_lustre: liblustreapi.so: cannot open shared object file: No such file or directory (ignored)

This is odd because liblustreapi.so should resolve reliably to /usr/lib64/liblustreapi.so, which is installed locally on each machine (so no funny business with network mappings).

Does anyone have any guesses as to what might be going on, or how I might mitigate these kinds of failures?

gregfi avatar Aug 12 '24 16:08 gregfi