ucx
ucx copied to clipboard
ucp_mem_map returns success even if NIC registration fails
Describe the bug
ucp_mem_map returns success as long as some memory domain succeeds to register the given region. This is misleading for the caller of ucp_mem_map because for instance, if the NIC (ib memory domain) fails to register cuda memory, but the cuda memory domain succeeds with its registration, ucp_mem_map will still return success.
More specifically, I ran into this issue on a system with a small BAR1 size that is 256 MB. I pass a cuda buffer to ucp_mem_map which returns success even though ibv_reg_mr fails due to small BAR1 size. Later on I use the same buffer for ucp active message communications (and set UCP_AM_SEND_FLAG_RNDV to enforce the rendezvous protocol). I expect this to succeed and the communication to be zero-copy, but it fails with the following trace from the receiver:
ib_md.c:349 UCX ERROR ibv_reg_mr(address=0x7fd6aa000000, length=268435456, access=0xf) failed: Bad address
ucp_mm.c:158 UCX ERROR failed to register address 0x7fd6aa000000 mem_type bit 0x2 length 268435456 on md[6]=mlx5_0: Input/output error (md reg_mem_types 0x3)
ucp_request.c:498 UCX ERROR failed to register user buffer datatype 0x8 address 0x7fd6aa000000 len 1048576: Input/output error
rndv.c:490 Assertion `status == UCS_OK' failed
==== backtrace (tid:1535612) ====
0 libucs.so.0(ucs_handle_error+0x2ec) [0x7fd742a54e1c]
1 libucs.so.0(ucs_fatal_error_message+0xc9) [0x7fd742a51889]
2 libucs.so.0(ucs_fatal_error_format+0x114) [0x7fd742a519a4]
3 libucp.so.0(+0x6f330) [0x7fd742aff330]
4 libucp.so.0(ucp_rndv_progress_rma_get_zcopy+0x9b) [0x7fd742b06f9b]
5 libucp.so.0(+0x70306) [0x7fd742b00306]
6 libucp.so.0(ucp_rndv_receive+0x188) [0x7fd742b03118]
7 libucp.so.0(ucp_am_recv_data_nbx+0xc75) [0x7fd742ab7f15]
@Akshay-Venkatesh @yosefe @bureddy
@Akshay-Venkatesh @yosefe @bureddy bringing this issue up again, and especially the crashing part of it. I understand that ucp_mem_map fixes will require further thoughts, but how easy/difficult it is to fix rndv so it does not crash when device buffer cannot be registered?