mpich icon indicating copy to clipboard operation
mpich copied to clipboard

Collective test failures with UCX-ROCm build

Open yfguo opened this issue 3 months ago • 3 comments

I got a few test failure when testing with the UCX build on AMD machines. Note it down here for later investigation.

Configuration: CC=amdclang CXX=amdclang++ ./configure --prefix=$PWD/_inst --with-device=ch4:ucx --with-ucx=$HOME/jlse-proj/soft/ucx-rocm-1.15.0 --with-hip=$ROCM_PATH --with-rccl=$HOME/jlse-proj/soft/rccl --enable-g=all

Failure:

    <testcase classname="coll" name="01444 - ./coll/reduce 5 -memtype=all " time="6.60977506637573">
      <failure type="TestFailed"
               message="not ok 1444 - ./coll/reduce 5"><![CDATA[not ok 1444 - ./coll/reduce 5
  ---
  Directory: ./coll
  File: reduce
  Num-procs: 5
  Timeout: 180
  Date: "Tue Sep 16 03:57:30 2025"
  ...
## Test output (expected 'No Errors'):
## [1757995046.794816] [amdgpu05:194643:0]           ib_md.c:309  UCX  ERROR ibv_reg_mr(address=0x7ef5f2306000, length=16384, access=0xf) failed: Invalid argument
## [1757995046.794851] [amdgpu05:194643:0]          ucp_mm.c:62   UCX  ERROR failed to register address 0x7ef5f2306000 (rocm) length 16384 on md[4]=mlx5_0: Input/output error (md supports: host|rocm)
## [1757995046.794855] [amdgpu05:194643:0]     ucp_request.c:555  UCX  ERROR failed to register user buffer datatype 0x8 address 0x7ef5f2306000 len 16384: Input/output error
## [amdgpu05:194643:0:194643]        rndv.c:536  Assertion `status == UCS_OK' failed
## 
## /home/yguo/jlse-proj/ucx-1.15.0/src/ucp/rndv/rndv.c: [ ucp_rndv_progress_rma_zcopy_common() ]
##       ...
##       533 
##       534     if (req->send.rndv.mdesc == NULL) {
##       535         status = ucp_send_request_add_reg_lane(req, lane);
## ==>   536         ucs_assert_always(status == UCS_OK);
##       537     }
##       538 
##       539     rsc_index = ucp_ep_get_rsc_index(ep, lane);
## 
## ==== backtrace (tid: 194643) ====
##  0 0x000000000007d3f8 ucp_rndv_progress_rma_zcopy_common()  /home/yguo/jlse-proj/ucx-1.15.0/src/ucp/rndv/rndv.c:536
##  1 0x000000000007d3f8 ucp_rndv_progress_rma_get_zcopy()  /home/yguo/jlse-proj/ucx-1.15.0/src/ucp/rndv/rndv.c:2221
##  2 0x0000000000078a79 ucp_request_try_send()  /home/yguo/jlse-proj/ucx-1.15.0/src/ucp/core/ucp_request.inl:349
##  3 0x0000000000078a79 ucp_request_send()  /home/yguo/jlse-proj/ucx-1.15.0/src/ucp/core/ucp_request.inl:372
##  4 0x0000000000078a79 ucp_rndv_req_send_rma_get()  /home/yguo/jlse-proj/ucx-1.15.0/src/ucp/rndv/rndv.c:954
##  5 0x0000000000078a79 ucp_rndv_receive()  /home/yguo/jlse-proj/ucx-1.15.0/src/ucp/rndv/rndv.c:1684
##  6 0x0000000000091ef0 ucp_tag_recv_common()  /home/yguo/jlse-proj/ucx-1.15.0/src/ucp/tag/tag_recv.c:169
##  7 0x0000000000091ef0 ucp_tag_recv_common()  /home/yguo/jlse-proj/ucx-1.15.0/src/ucp/tag/tag_recv.c:172
##  8 0x0000000000091ef0 ucp_tag_recv_nbx()  /home/yguo/jlse-proj/ucx-1.15.0/src/ucp/tag/tag_recv.c:243
##  9 0x0000000000439a24 MPID_Irecv()  /home/yguo/jlse-proj/mpich-rccl/./src/mpid/ch4/netmod/include/../ucx/ucx_recv.h:215
## 10 0x00000000004391e1 MPIC_Recv()  /home/yguo/jlse-proj/mpich-rccl/src/mpi/coll/helper_fns.c:206
## 11 0x000000000039add1 MPIR_Reduce_intra_reduce_scatter_gather()  /home/yguo/jlse-proj/mpich-rccl/src/mpi/coll/reduce/reduce_intra_reduce_scatter_gather.c:314
## 12 0x000000000041e6c2 MPIR_Reduce_allcomm_auto()  /home/yguo/jlse-proj/mpich-rccl/src/mpi/coll/mpir_coll.c:4320
## 13 0x000000000041e7da MPIR_Reduce_impl()  /home/yguo/jlse-proj/mpich-rccl/src/mpi/coll/mpir_coll.c:0
## 14 0x000000000043470a MPIDI_NM_mpi_reduce()  /home/yguo/jlse-proj/mpich-rccl/./src/mpid/ch4/netmod/include/../ucx/ucx_coll.h:244
## 15 0x000000000043470a MPIDI_Reduce_intra_composition_gamma()  /home/yguo/jlse-proj/mpich-rccl/./src/mpid/ch4/src/ch4_coll_impl.h:1134
## 16 0x000000000041f176 MPIR_Reduce()  /home/yguo/jlse-proj/mpich-rccl/./src/mpid/ch4/src/ch4_coll.h:1273
## 17 0x000000000023c00a PMPI_Reduce()  /home/yguo/jlse-proj/mpich-rccl/src/binding/c/coll/reduce.c:155
## 18 0x0000000000227e1c test_reduce()  /home/yguo/jlse-proj/mpich-rccl/test/mpi/coll/reduce.c:86
## 19 0x0000000000227e1c coll_reduce()  /home/yguo/jlse-proj/mpich-rccl/test/mpi/coll/reduce.c:124
## 20 0x000000000020cd86 main()  /home/yguo/jlse-proj/mpich-rccl/test/mpi/util/run_mpitests.c:64
## 21 0x0000000000040e6c __libc_start_call_main()  ???:0
## 22 0x0000000000040f35 __libc_start_main_alias_2()  ???:0
## 23 0x000000000020c901 _start()  /home/abuild/rpmbuild/BUILD/glibc-2.38/csu/../sysdeps/x86_64/start.S:115
## =================================
## 
## ===================================================================================
## =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
## =   PID 194643 RUNNING AT amdgpu05
## =   EXIT CODE: 6
## =   CLEANING UP REMAINING PROCESSES
## =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
## ===================================================================================
## YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Terminated (signal 15)
## This typically refers to a problem with your application.
## Please see the FAQ page for debugging suggestions
    ]]></failure>
    </testcase>

yfguo avatar Sep 16 '25 15:09 yfguo

ibv_reg_mr failed. For some reason UCX think the address is host, not rocm.

yfguo avatar Sep 16 '25 20:09 yfguo

ibv_reg_mr failed. For some reason UCX think the address is host, not rocm.

You can try UCX_MEMTYPE_CACHE=no to disable the memory type cache in UCX and see if it makes a difference.

raffenet avatar Sep 18 '25 15:09 raffenet

I did a quick test with UCX_MEMTYPE_CACHE=no, still the same issue. The problem only happens in rndv path of the IB communication. Keep digging.

yfguo avatar Sep 18 '25 18:09 yfguo