ucx icon indicating copy to clipboard operation
ucx copied to clipboard

UCT/API: add dmabuf to md_mem_query attributes

Open Akshay-Venkatesh opened this issue 3 years ago • 13 comments

What

Add dmabuf fd field in md_mem attributes

Why ?

Needed by UCT/IB to register device memory exposed as a dmabuf

Akshay-Venkatesh avatar Jan 13 '22 20:01 Akshay-Venkatesh

cc @yosefe

Akshay-Venkatesh avatar Jan 13 '22 20:01 Akshay-Venkatesh

@shamisp Can you check if any changes are still needed?

Akshay-Venkatesh avatar May 11 '22 19:05 Akshay-Venkatesh

failures seems relevant

[swx-rdmz-ucx-gpu-01:12226:0:12226] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x9)
==== backtrace (tid:  12226) ====
 0  /__w/1/s/build-test/src/ucs/.libs/libucs.so.0(ucs_handle_error+0x12c) [0x7f3192e6cbac]
 1  /__w/1/s/build-test/src/ucs/.libs/libucs.so.0(+0x34ebe) [0x7f3192e6cebe]
 2  /__w/1/s/build-test/src/ucs/.libs/libucs.so.0(+0x3513b) [0x7f3192e6d13b]
 3  /usr/lib64/libpthread.so.0(+0xf630) [0x7f318f2da630]
 4  /__w/1/s/build-test/src/ucp/.libs/libucp.so.0(ucp_mem_type_pack+0xb0) [0x7f3192b49fb0]
 5  /__w/1/s/build-test/src/ucp/.libs/libucp.so.0(ucp_dt_pack+0xd4) [0x7f3192b4a244]
 6  /__w/1/s/build-test/src/ucp/.libs/libucp.so.0(+0xc82fb) [0x7f3192baa2fb]
 7  /__w/1/s/build-test/src/ucs/.libs/ucx/libuct_ib.so.0(uct_rc_mlx5_ep_am_bcopy+0x2aa) [0x7f318e240eaa]
 8  /__w/1/s/build-test/src/ucp/.libs/libucp.so.0(+0xc93a6) [0x7f3192bab3a6]
 9  /__w/1/s/build-test/src/ucp/.libs/libucp.so.0(ucp_wireup_replay_pending_requests+0x4d) [0x7f3192befaad]
10  /__w/1/s/build-test/src/ucp/.libs/libucp.so.0(+0x108a57) [0x7f3192beaa57]
11  /__w/1/s/build-test/src/ucs/.libs/libucs.so.0(+0x2612b) [0x7f3192e5e12b]
12  /__w/1/s/build-test/src/ucp/.libs/libucp.so.0(ucp_worker_progress+0x72) [0x7f3192b459e2]
13  /__w/1/s/build-test/examples/.libs/lt-ucp_client_server() [0x4026d0]
14  /__w/1/s/build-test/examples/.libs/lt-ucp_client_server() [0x4042df]
15  /__w/1/s/build-test/examples/.libs/lt-ucp_client_server() [0x401ca1]
16  /usr/lib64/libc.so.6(__libc_start_main+0xf5) [0x7f318ed17555]
17  /__w/1/s/build-test/examples/.libs/lt-ucp_client_server() [0x401dd4]
=================================

yosefe avatar May 18 '22 08:05 yosefe

@yosefe This error doesn't seem UCX related https://dev.azure.com/ucfconsort/ucx/_build/results?buildId=46759&view=logs&j=3af09b09-f681-502e-a77b-ab1dc5457b44&t=38d5da4a-ed53-5453-fe91-8988b34f7242&l=2076. Can you help resolve when possible?

Akshay-Venkatesh avatar Jun 27 '22 18:06 Akshay-Venkatesh

@yosefe Another one that seems unrelated https://dev.azure.com/ucfconsort/ucx/_build/results?buildId=46770&view=logs&j=74876da5-c0e8-5509-4929-d550f147815a&t=8a45b11e-6db7-52a5-8389-552756497e36&l=8766

Akshay-Venkatesh avatar Jun 27 '22 21:06 Akshay-Venkatesh

@yosefe https://dev.azure.com/ucfconsort/ucx/_build/results?buildId=47043&view=logs&j=cc064a77-22b5-56bf-ecc0-70b5fe764261&t=b5a848c0-0a48-4385-bfb2-52d808f7151a&l=18 seems unrelated. Is it possible to restart the failing tests?

Akshay-Venkatesh avatar Jul 05 '22 21:07 Akshay-Venkatesh

@yosefe Do these errors look related? https://dev.azure.com/ucfconsort/ucx/_build/results?buildId=47626&view=logs&j=022a669b-8da8-5bd8-ffff-0833ae356793&t=c9b897af-c0e4-5d2e-5d41-4380dc6de0ef&l=28548

Akshay-Venkatesh avatar Jul 20 '22 04:07 Akshay-Venkatesh

@Akshay-Venkatesh since other PRs seems to pass tests, seems the failure is relevant maybe some kind of memory corruption?

yosefe avatar Jul 20 '22 07:07 yosefe

@Akshay-Venkatesh since other PRs seems to pass tests, seems the failure is relevant maybe some kind of memory corruption?

@yosefe I reran failing tests for 1000 iterations and I don't see the same failure. Would it possible to rerun tests? cc @bureddy

Akshay-Venkatesh avatar Jul 25 '22 18:07 Akshay-Venkatesh

/azp run

yosefe avatar Jul 25 '22 18:07 yosefe

Azure Pipelines successfully started running 3 pipeline(s).

azure-pipelines[bot] avatar Jul 25 '22 18:07 azure-pipelines[bot]

The same errors are repeating but unable to repro with the same build configuration (tested on Rome+A100). Not sure how I should go about debugging the issue.

Akshay-Venkatesh avatar Jul 25 '22 21:07 Akshay-Venkatesh

@Akshay-Venkatesh the issue can be triggered by running multiple tests one after another in CI. Maybe try to revert some of the changes in this PR for debug purpose to see which part is causing the failure?

yosefe avatar Jul 26 '22 08:07 yosefe