ucx icon indicating copy to clipboard operation
ucx copied to clipboard

UCP/WIREUP: specify a reason for wireup failure - part 2

Open amastbaum opened this issue 1 year ago • 2 comments

What

During the wireup process, provide a reason when a resource's lane is unreachable (second part after https://github.com/openucx/ucx/pull/9995. Includes all the transports other than IB).

Why ?

Users need more information about why a device is not reachable after an unsuccessful connection establishment. This information should be passed from UCT to UCP.

How ?

Pass a string to select/search_lane functions so in case the wireup process fails, the reason will be printed out in an upper layer.

amastbaum avatar Aug 11 '24 11:08 amastbaum

@shasson5 can you pls review?

yosefe avatar Aug 19 '24 07:08 yosefe

failure seems relevant:

2024-08-26T07:53:11.2247900Z [ RUN      ] tcp/test_uct_sockaddr_err_handle_non_exist_ip.conn_to_non_exist_ip/2 </lo>
2024-08-26T07:53:11.2250959Z [     INFO ] Testing tcp on 0.0.0.0:49036 interface lo
2024-08-26T07:55:18.5225372Z [       OK ] tcp/test_uct_sockaddr_err_handle_non_exist_ip.conn_to_non_exist_ip/2 (127298 ms)
2024-08-26T07:55:18.5256477Z [----------] 1 test from tcp/test_uct_sockaddr_err_handle_non_exist_ip (127298 ms total)
2024-08-26T07:55:18.5257776Z 
2024-08-26T07:55:18.5258713Z [----------] 1 test from cuda_ipc/test_uct_ep
2024-08-26T07:55:18.5261351Z [ RUN      ] cuda_ipc/test_uct_ep.is_connected/0 <cuda_ipc/cuda>
2024-08-26T07:55:18.5262355Z /__w/1/s/contrib/../test/gtest/uct/test_uct_ep.cc:220: Failure
2024-08-26T07:55:18.5262982Z Value of: is_connected_to_sender(*m_receiver)
2024-08-26T07:55:18.5263335Z   Actual: false
2024-08-26T07:55:18.5263591Z Expected: true
2024-08-26T07:55:18.5263890Z [  FAILED  ] cuda_ipc/test_uct_ep.is_connected/0, where GetParam() = cuda_ipc/cuda (0 ms)
2024-08-26T07:55:18.5264403Z [----------] 1 test from cuda_ipc/test_uct_ep (0 ms total)

yosefe avatar Aug 26 '24 10:08 yosefe

coverity failure seems relevant

yosefe avatar Aug 29 '24 08:08 yosefe

@yosefe it needs to be approved again. Thanks

amastbaum avatar Sep 01 '24 07:09 amastbaum

@amastbaum please squash

gleon99 avatar Sep 01 '24 08:09 gleon99