ucx
ucx copied to clipboard
UCT/API: add dmabuf to md_mem_query attributes
What
Add dmabuf fd field in md_mem attributes
Why ?
Needed by UCT/IB to register device memory exposed as a dmabuf
cc @yosefe
@shamisp Can you check if any changes are still needed?
failures seems relevant
[swx-rdmz-ucx-gpu-01:12226:0:12226] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x9)
==== backtrace (tid: 12226) ====
0 /__w/1/s/build-test/src/ucs/.libs/libucs.so.0(ucs_handle_error+0x12c) [0x7f3192e6cbac]
1 /__w/1/s/build-test/src/ucs/.libs/libucs.so.0(+0x34ebe) [0x7f3192e6cebe]
2 /__w/1/s/build-test/src/ucs/.libs/libucs.so.0(+0x3513b) [0x7f3192e6d13b]
3 /usr/lib64/libpthread.so.0(+0xf630) [0x7f318f2da630]
4 /__w/1/s/build-test/src/ucp/.libs/libucp.so.0(ucp_mem_type_pack+0xb0) [0x7f3192b49fb0]
5 /__w/1/s/build-test/src/ucp/.libs/libucp.so.0(ucp_dt_pack+0xd4) [0x7f3192b4a244]
6 /__w/1/s/build-test/src/ucp/.libs/libucp.so.0(+0xc82fb) [0x7f3192baa2fb]
7 /__w/1/s/build-test/src/ucs/.libs/ucx/libuct_ib.so.0(uct_rc_mlx5_ep_am_bcopy+0x2aa) [0x7f318e240eaa]
8 /__w/1/s/build-test/src/ucp/.libs/libucp.so.0(+0xc93a6) [0x7f3192bab3a6]
9 /__w/1/s/build-test/src/ucp/.libs/libucp.so.0(ucp_wireup_replay_pending_requests+0x4d) [0x7f3192befaad]
10 /__w/1/s/build-test/src/ucp/.libs/libucp.so.0(+0x108a57) [0x7f3192beaa57]
11 /__w/1/s/build-test/src/ucs/.libs/libucs.so.0(+0x2612b) [0x7f3192e5e12b]
12 /__w/1/s/build-test/src/ucp/.libs/libucp.so.0(ucp_worker_progress+0x72) [0x7f3192b459e2]
13 /__w/1/s/build-test/examples/.libs/lt-ucp_client_server() [0x4026d0]
14 /__w/1/s/build-test/examples/.libs/lt-ucp_client_server() [0x4042df]
15 /__w/1/s/build-test/examples/.libs/lt-ucp_client_server() [0x401ca1]
16 /usr/lib64/libc.so.6(__libc_start_main+0xf5) [0x7f318ed17555]
17 /__w/1/s/build-test/examples/.libs/lt-ucp_client_server() [0x401dd4]
=================================
@yosefe This error doesn't seem UCX related https://dev.azure.com/ucfconsort/ucx/_build/results?buildId=46759&view=logs&j=3af09b09-f681-502e-a77b-ab1dc5457b44&t=38d5da4a-ed53-5453-fe91-8988b34f7242&l=2076. Can you help resolve when possible?
@yosefe Another one that seems unrelated https://dev.azure.com/ucfconsort/ucx/_build/results?buildId=46770&view=logs&j=74876da5-c0e8-5509-4929-d550f147815a&t=8a45b11e-6db7-52a5-8389-552756497e36&l=8766
@yosefe https://dev.azure.com/ucfconsort/ucx/_build/results?buildId=47043&view=logs&j=cc064a77-22b5-56bf-ecc0-70b5fe764261&t=b5a848c0-0a48-4385-bfb2-52d808f7151a&l=18 seems unrelated. Is it possible to restart the failing tests?
@yosefe Do these errors look related? https://dev.azure.com/ucfconsort/ucx/_build/results?buildId=47626&view=logs&j=022a669b-8da8-5bd8-ffff-0833ae356793&t=c9b897af-c0e4-5d2e-5d41-4380dc6de0ef&l=28548
@Akshay-Venkatesh since other PRs seems to pass tests, seems the failure is relevant maybe some kind of memory corruption?
@Akshay-Venkatesh since other PRs seems to pass tests, seems the failure is relevant maybe some kind of memory corruption?
@yosefe I reran failing tests for 1000 iterations and I don't see the same failure. Would it possible to rerun tests? cc @bureddy
/azp run
Azure Pipelines successfully started running 3 pipeline(s).
The same errors are repeating but unable to repro with the same build configuration (tested on Rome+A100). Not sure how I should go about debugging the issue.
@Akshay-Venkatesh the issue can be triggered by running multiple tests one after another in CI. Maybe try to revert some of the changes in this PR for debug purpose to see which part is causing the failure?