Collective test failures with UCX-ROCm build
I got a few test failure when testing with the UCX build on AMD machines. Note it down here for later investigation.
Configuration: CC=amdclang CXX=amdclang++ ./configure --prefix=$PWD/_inst --with-device=ch4:ucx --with-ucx=$HOME/jlse-proj/soft/ucx-rocm-1.15.0 --with-hip=$ROCM_PATH --with-rccl=$HOME/jlse-proj/soft/rccl --enable-g=all
Failure:
<testcase classname="coll" name="01444 - ./coll/reduce 5 -memtype=all " time="6.60977506637573">
<failure type="TestFailed"
message="not ok 1444 - ./coll/reduce 5"><![CDATA[not ok 1444 - ./coll/reduce 5
---
Directory: ./coll
File: reduce
Num-procs: 5
Timeout: 180
Date: "Tue Sep 16 03:57:30 2025"
...
## Test output (expected 'No Errors'):
## [1757995046.794816] [amdgpu05:194643:0] ib_md.c:309 UCX ERROR ibv_reg_mr(address=0x7ef5f2306000, length=16384, access=0xf) failed: Invalid argument
## [1757995046.794851] [amdgpu05:194643:0] ucp_mm.c:62 UCX ERROR failed to register address 0x7ef5f2306000 (rocm) length 16384 on md[4]=mlx5_0: Input/output error (md supports: host|rocm)
## [1757995046.794855] [amdgpu05:194643:0] ucp_request.c:555 UCX ERROR failed to register user buffer datatype 0x8 address 0x7ef5f2306000 len 16384: Input/output error
## [amdgpu05:194643:0:194643] rndv.c:536 Assertion `status == UCS_OK' failed
##
## /home/yguo/jlse-proj/ucx-1.15.0/src/ucp/rndv/rndv.c: [ ucp_rndv_progress_rma_zcopy_common() ]
## ...
## 533
## 534 if (req->send.rndv.mdesc == NULL) {
## 535 status = ucp_send_request_add_reg_lane(req, lane);
## ==> 536 ucs_assert_always(status == UCS_OK);
## 537 }
## 538
## 539 rsc_index = ucp_ep_get_rsc_index(ep, lane);
##
## ==== backtrace (tid: 194643) ====
## 0 0x000000000007d3f8 ucp_rndv_progress_rma_zcopy_common() /home/yguo/jlse-proj/ucx-1.15.0/src/ucp/rndv/rndv.c:536
## 1 0x000000000007d3f8 ucp_rndv_progress_rma_get_zcopy() /home/yguo/jlse-proj/ucx-1.15.0/src/ucp/rndv/rndv.c:2221
## 2 0x0000000000078a79 ucp_request_try_send() /home/yguo/jlse-proj/ucx-1.15.0/src/ucp/core/ucp_request.inl:349
## 3 0x0000000000078a79 ucp_request_send() /home/yguo/jlse-proj/ucx-1.15.0/src/ucp/core/ucp_request.inl:372
## 4 0x0000000000078a79 ucp_rndv_req_send_rma_get() /home/yguo/jlse-proj/ucx-1.15.0/src/ucp/rndv/rndv.c:954
## 5 0x0000000000078a79 ucp_rndv_receive() /home/yguo/jlse-proj/ucx-1.15.0/src/ucp/rndv/rndv.c:1684
## 6 0x0000000000091ef0 ucp_tag_recv_common() /home/yguo/jlse-proj/ucx-1.15.0/src/ucp/tag/tag_recv.c:169
## 7 0x0000000000091ef0 ucp_tag_recv_common() /home/yguo/jlse-proj/ucx-1.15.0/src/ucp/tag/tag_recv.c:172
## 8 0x0000000000091ef0 ucp_tag_recv_nbx() /home/yguo/jlse-proj/ucx-1.15.0/src/ucp/tag/tag_recv.c:243
## 9 0x0000000000439a24 MPID_Irecv() /home/yguo/jlse-proj/mpich-rccl/./src/mpid/ch4/netmod/include/../ucx/ucx_recv.h:215
## 10 0x00000000004391e1 MPIC_Recv() /home/yguo/jlse-proj/mpich-rccl/src/mpi/coll/helper_fns.c:206
## 11 0x000000000039add1 MPIR_Reduce_intra_reduce_scatter_gather() /home/yguo/jlse-proj/mpich-rccl/src/mpi/coll/reduce/reduce_intra_reduce_scatter_gather.c:314
## 12 0x000000000041e6c2 MPIR_Reduce_allcomm_auto() /home/yguo/jlse-proj/mpich-rccl/src/mpi/coll/mpir_coll.c:4320
## 13 0x000000000041e7da MPIR_Reduce_impl() /home/yguo/jlse-proj/mpich-rccl/src/mpi/coll/mpir_coll.c:0
## 14 0x000000000043470a MPIDI_NM_mpi_reduce() /home/yguo/jlse-proj/mpich-rccl/./src/mpid/ch4/netmod/include/../ucx/ucx_coll.h:244
## 15 0x000000000043470a MPIDI_Reduce_intra_composition_gamma() /home/yguo/jlse-proj/mpich-rccl/./src/mpid/ch4/src/ch4_coll_impl.h:1134
## 16 0x000000000041f176 MPIR_Reduce() /home/yguo/jlse-proj/mpich-rccl/./src/mpid/ch4/src/ch4_coll.h:1273
## 17 0x000000000023c00a PMPI_Reduce() /home/yguo/jlse-proj/mpich-rccl/src/binding/c/coll/reduce.c:155
## 18 0x0000000000227e1c test_reduce() /home/yguo/jlse-proj/mpich-rccl/test/mpi/coll/reduce.c:86
## 19 0x0000000000227e1c coll_reduce() /home/yguo/jlse-proj/mpich-rccl/test/mpi/coll/reduce.c:124
## 20 0x000000000020cd86 main() /home/yguo/jlse-proj/mpich-rccl/test/mpi/util/run_mpitests.c:64
## 21 0x0000000000040e6c __libc_start_call_main() ???:0
## 22 0x0000000000040f35 __libc_start_main_alias_2() ???:0
## 23 0x000000000020c901 _start() /home/abuild/rpmbuild/BUILD/glibc-2.38/csu/../sysdeps/x86_64/start.S:115
## =================================
##
## ===================================================================================
## = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
## = PID 194643 RUNNING AT amdgpu05
## = EXIT CODE: 6
## = CLEANING UP REMAINING PROCESSES
## = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
## ===================================================================================
## YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Terminated (signal 15)
## This typically refers to a problem with your application.
## Please see the FAQ page for debugging suggestions
]]></failure>
</testcase>
ibv_reg_mr failed. For some reason UCX think the address is host, not rocm.
ibv_reg_mr failed. For some reason UCX think the address is host, not rocm.
You can try UCX_MEMTYPE_CACHE=no to disable the memory type cache in UCX and see if it makes a difference.
I did a quick test with UCX_MEMTYPE_CACHE=no, still the same issue. The problem only happens in rndv path of the IB communication. Keep digging.