ompi
ompi copied to clipboard
Sigbus in UCX call inside opal_common_ucx_del_procs_nofence
I'm consistently seeing a Sigbus (Signal 7) occuring in a call to opal_common_ucx_del_procs_nofence during shutdown on Hawk. I'm running Open MPI main (v2.x-dev-9510-g7bb5fe1a2f) with UCX 1.12.0. UCX is a system installation, Open MPI is my own installation. I'm using osu_scatter. The issue seems to (in part) depend on the process placement: it doesn't occur with 128 on 4 nodes with 32 processes per node but occurs with 129 processes on 4 nodes or with 128 processes and 33 processes per node (leaving one node underutilized). It seems to occur more consistently on more than 256 processes. The signal does typically occur on multiple processes but not on all 512 processes.
My command line:
mpirun -n $((128)) --mca coll ^hcoll -N 33 ./build/mpi/collective/osu_scatter
Disabling btl/ucx does not help. With coll/hcoll enabled or pml/ucx disabled everything works, so it seems to be a problem with pml/ucx.
The backtrace I see:
==== backtrace (tid: 313993) ====
0 0x0000000000012b20 .annobin_sigaction.c() sigaction.c:0
1 0x000000000001c3a7 uct_mm_ep_has_tx_resources() /zhome/academic/HLRS/hlrs/hpcoft14/work/mpi/ucx/1.12.0/ucx-1.12.0/src/uct/sm/mm/base/mm_ep.c:406
2 0x000000000001c3a7 uct_mm_ep_flush() /zhome/academic/HLRS/hlrs/hpcoft14/work/mpi/ucx/1.12.0/ucx-1.12.0/src/uct/sm/mm/base/mm_ep.c:517
3 0x000000000007128f uct_ep_flush() /zhome/academic/HLRS/hlrs/hpcoft14/work/mpi/ucx/1.12.0/ucx-1.12.0/src/uct/api/uct.h:3047
4 0x00000000000722fb ucp_ep_flush_internal() /zhome/academic/HLRS/hlrs/hpcoft14/work/mpi/ucx/1.12.0/ucx-1.12.0/src/ucp/rma/flush.c:341
5 0x00000000000722fb ucp_ep_flush_internal() /zhome/academic/HLRS/hlrs/hpcoft14/work/mpi/ucx/1.12.0/ucx-1.12.0/src/ucp/rma/flush.c:343
6 0x0000000000030d2d ucp_ep_close_nbx() /zhome/academic/HLRS/hlrs/hpcoft14/work/mpi/ucx/1.12.0/ucx-1.12.0/src/ucp/core/ucp_ep.c:1488
7 0x00000000000308cd ucp_ep_close_nb() /zhome/academic/HLRS/hlrs/hpcoft14/work/mpi/ucx/1.12.0/ucx-1.12.0/src/ucp/core/ucp_ep.c:1417
8 0x00000000000d8be5 opal_common_ucx_del_procs_nofence() opal/mca/common/ucx/common_ucx.c:448
9 0x00000000000d8d79 opal_common_ucx_del_procs() opal/mca/common/ucx/common_ucx.c:476
10 0x000000000028cd41 mca_pml_ucx_del_procs() ompi/mca/pml/ucx/pml_ucx.c:554
11 0x00000000000b973d ompi_mpi_instance_cleanup_pml() ompi/instance/instance.c:157
12 0x0000000000037537 opal_finalize_cleanup_domain() opal/runtime/opal_finalize.c:135
13 0x0000000000037537 opal_list_remove_item() opal/class/opal_list.h:478
14 0x0000000000037537 opal_finalize_cleanup_domain() opal/runtime/opal_finalize.c:136
15 0x00000000000378ff opal_finalize() opal/runtime/opal_finalize.c:171
16 0x00000000000b12e3 ompi_rte_finalize() ompi/runtime/ompi_rte.c:1028
17 0x00000000000bce60 ompi_mpi_instance_finalize_common() ompi/instance/instance.c:893
18 0x00000000000bce60 ompi_mpi_instance_finalize() ompi/instance/instance.c:947
19 0x00000000000ac19c ompi_mpi_finalize() ompi/runtime/ompi_mpi_finalize.c:294
20 0x0000000000402646 main() mpi/collective/osu_bcast.c:119
21 0x0000000000023493 __libc_start_main() ???:0
22 0x00000000004028ae _start() ???:0
=================================
@devreal are you using Lustre?
There is a Lustre filesystem available but I am not using it in my runs.
Can you pls make sure the binary files (Ompi/ucx libraries) are not on Lustre filesystem? There is a known issue that this can cause SIGBUS
AFAICS, all software installation directories and my home (from where I built Open MPI and where the installation is located) are NFS. I couldn't find any linked library on a Lustre FS.
can you pls try UCX_TLS=^xpmem and UCX_TLS=^sm ?
I tried both. The good news is: the SIGBUS disappears. The bad news is that runs tend to hang with both ^xpmem and ^sm at the end of the run (I assume in MPI_Finalize but I'm not sure yet). It's less likely with ^sm than with ^xpmem, though.