ompi icon indicating copy to clipboard operation
ompi copied to clipboard

btl/uct not compatible with AM RDMA?

Open devreal opened this issue 3 years ago • 1 comments

I am trying to run the example in https://github.com/open-mpi/ompi/issues/10328 with osc/rdma and btl/uct. My understanding is that this combination should work, but it doesn't. I run the example as follows:

mpirun -n 2 --mca btl self,uct --mca btl_base_verbose 100 --mca btl_uct_memory_domains all ./test_win_dynamic

and get the following error:

[hawk-login03][[7730,1],0][../../../../opal/mca/btl/base/btl_base_am_rdma.c:958:am_rdma_process_rdma] BTL is not compatible with active-message RDMA

The GDB back trace is:

Thread 1 "test_win_dynami" received signal SIGABRT, Aborted.
0x00007fffecb8d37f in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x00007fffecb8d37f in raise () from /lib64/libc.so.6
#1  0x00007fffecb77db5 in abort () from /lib64/libc.so.6
#2  0x00007fffea4faf88 in am_rdma_process_rdma (btl=<optimized out>, desc=<optimized out>) at ../../../../opal/mca/btl/base/btl_base_am_rdma.c:983
#3  0x00007fffea4fd314 in mca_btl_uct_am_handler (arg=0x5f9190, data=<optimized out>, length=<optimized out>, flags=<optimized out>)
    at ../../../../../opal/mca/btl/uct/btl_uct_component.c:329
#4  0x00007fffeb170e46 in uct_iface_invoke_am (flags=0, length=<optimized out>, data=0x7fffd66900dc, id=<optimized out>, iface=0x5f9890)
    at /zhome/academic/HLRS/hlrs/hpcoft14/work/mpi/ucx/1.11.2/ucx-1.11.2/src/uct/base/uct_iface.h:769
#5  uct_mm_iface_invoke_am (flags=0, length=<optimized out>, data=0x7fffd66900dc, am_id=<optimized out>, iface=0x5f9890)
    at /zhome/academic/HLRS/hlrs/hpcoft14/work/mpi/ucx/1.11.2/ucx-1.11.2/src/uct/sm/mm/base/mm_iface.h:262
#6  uct_mm_iface_process_recv (iface=0x5f9890) at /zhome/academic/HLRS/hlrs/hpcoft14/work/mpi/ucx/1.11.2/ucx-1.11.2/src/uct/sm/mm/base/mm_iface.c:251
#7  uct_mm_iface_poll_fifo (iface=0x5f9890) at /zhome/academic/HLRS/hlrs/hpcoft14/work/mpi/ucx/1.11.2/ucx-1.11.2/src/uct/sm/mm/base/mm_iface.c:299
#8  uct_mm_iface_progress (tl_iface=0x5f9890) at /zhome/academic/HLRS/hlrs/hpcoft14/work/mpi/ucx/1.11.2/ucx-1.11.2/src/uct/sm/mm/base/mm_iface.c:352
#9  0x00007fffea4fc45a in ucs_callbackq_dispatch (cbq=<optimized out>) at /opt/hlrs/non-spack/mpi/openmpi/ucx/1.11.2/include/ucs/datastruct/callbackq.h:211
#10 uct_worker_progress (worker=<optimized out>) at /opt/hlrs/non-spack/mpi/openmpi/ucx/1.11.2/include/uct/api/uct.h:2592
#11 mca_btl_uct_context_progress (context=0x5f9190) at ../../../../../opal/mca/btl/uct/btl_uct_device_context.h:165
#12 mca_btl_uct_tl_progress (tl=0x5f8970, starting_index=<optimized out>) at ../../../../../opal/mca/btl/uct/btl_uct_component.c:577
#13 0x00007fffea4fc79a in mca_btl_uct_tl_progress (starting_index=<optimized out>, tl=<optimized out>) at ../../../../../opal/mca/btl/uct/btl_uct_component.c:571
#14 mca_btl_uct_component_progress () at ../../../../../opal/mca/btl/uct/btl_uct_component.c:631
#15 0x00007fffea4b3413 in opal_progress () at ../../opal/runtime/opal_progress.c:224
#16 0x00007fffec85fe3f in hcoll_ml_progress_impl () from /opt/mellanox/hcoll/lib/libhcoll.so.1
#17 0x00007fffec847e1c in hmca_coll_ml_barrier_intra () from /opt/mellanox/hcoll/lib/libhcoll.so.1
#18 0x00007fffed5eac12 in mca_coll_hcoll_barrier (comm=0x8a1760, module=0xbc6820) at ../../../../../ompi/mca/coll/hcoll/coll_hcoll_ops.c:29
#19 0x00007fffed6934a8 in ompi_osc_rdma_free (win=0xaa13c0) at ../../../../../ompi/mca/osc/rdma/osc_rdma_module.c:68
#20 0x00007fffed56fe37 in ompi_win_free (win=0xaa13c0) at ../../ompi/win/win.c:382
#21 0x00007fffed5b7898 in PMPI_Win_free (win=0x7fffffffaf70) at ../../../../ompi/mpi/c/win_free.c:51
#22 0x00000000004011e6 in main ()

The code fails because mca_btl_base_receive_descriptor_t does not have an endpoint set (https://github.com/open-mpi/ompi/blob/main/opal/mca/btl/base/btl_base_am_rdma.c#L958). I'm not sure whether that is a problem with btl/uct or a wrong assumption in osc/base. Either way, the application should not abort out of the blue and the error should be handled more gracefully, or avoided if possible.

devreal avatar Jun 28 '22 17:06 devreal

Removing me from the ticket; it's unlikely I'll ever have time to work on this issue.

bwbarrett avatar Jun 28 '22 17:06 bwbarrett