ucx icon indicating copy to clipboard operation
ucx copied to clipboard

ucp_mem_map returns success even if NIC registration fails

Open SeyedMir opened this issue 2 years ago • 2 comments

Describe the bug

ucp_mem_map returns success as long as some memory domain succeeds to register the given region. This is misleading for the caller of ucp_mem_map because for instance, if the NIC (ib memory domain) fails to register cuda memory, but the cuda memory domain succeeds with its registration, ucp_mem_map will still return success.

More specifically, I ran into this issue on a system with a small BAR1 size that is 256 MB. I pass a cuda buffer to ucp_mem_map which returns success even though ibv_reg_mr fails due to small BAR1 size. Later on I use the same buffer for ucp active message communications (and set UCP_AM_SEND_FLAG_RNDV to enforce the rendezvous protocol). I expect this to succeed and the communication to be zero-copy, but it fails with the following trace from the receiver:

ib_md.c:349  UCX  ERROR ibv_reg_mr(address=0x7fd6aa000000, length=268435456, access=0xf) failed: Bad address
 ucp_mm.c:158  UCX  ERROR failed to register address 0x7fd6aa000000 mem_type bit 0x2 length 268435456 on md[6]=mlx5_0: Input/output error (md reg_mem_types 0x3)
ucp_request.c:498  UCX  ERROR failed to register user buffer datatype 0x8 address 0x7fd6aa000000 len 1048576: Input/output error
rndv.c:490  Assertion `status == UCS_OK' failed
==== backtrace (tid:1535612) ====
 0  libucs.so.0(ucs_handle_error+0x2ec) [0x7fd742a54e1c]
 1  libucs.so.0(ucs_fatal_error_message+0xc9) [0x7fd742a51889]
 2  libucs.so.0(ucs_fatal_error_format+0x114) [0x7fd742a519a4]
 3  libucp.so.0(+0x6f330) [0x7fd742aff330]
 4  libucp.so.0(ucp_rndv_progress_rma_get_zcopy+0x9b) [0x7fd742b06f9b]
 5  libucp.so.0(+0x70306) [0x7fd742b00306]
 6  libucp.so.0(ucp_rndv_receive+0x188) [0x7fd742b03118]
 7  libucp.so.0(ucp_am_recv_data_nbx+0xc75) [0x7fd742ab7f15]

SeyedMir avatar Feb 28 '22 22:02 SeyedMir

@Akshay-Venkatesh @yosefe @bureddy

SeyedMir avatar Feb 28 '22 22:02 SeyedMir

@Akshay-Venkatesh @yosefe @bureddy bringing this issue up again, and especially the crashing part of it. I understand that ucp_mem_map fixes will require further thoughts, but how easy/difficult it is to fix rndv so it does not crash when device buffer cannot be registered?

SeyedMir avatar May 11 '22 20:05 SeyedMir