ucx
ucx copied to clipboard
UCP: Ignore path idx for non-bw lanes
What
Ignore path indexes when adding new lane and there is existing non-bw lane with the same tl resources
Why ?
The problem shown on the picture below may appear on heterogeneous configurations, like:
- only one of nodes has bonding device/interface configured
- different number of paths used by devices on different peers
- etc
@brminich how was this issue found and can be reproduced? Is this ep reconfiguration issue? does it mean AM lane will use 2nd path now?
@brminich how was this issue found and can be reproduced? Is this ep reconfiguration issue? does it mean AM lane will use 2nd path now?
I got an assert with OSU on venus when one of the nodes was in some incorrect state. Unfortunately I lost that config to continue debugging, but I managed to reproduce it on another cluster with ucp_client_server
(but I had to provide different params to the client and server):
server:
UCX_NET_DEVICES=mlx5_0:1,mlx5_4:1 UCX_LOG_LEVEL=info UCX_IB_NUM_PATHS=2 ./examples/ucp_client_server -c tag
client:
UCX_NET_DEVICES=mlx5_0:1,mlx5_4:1 UCX_LOG_LEVEL=info UCX_IB_NUM_PATHS=1 UCX_IB_PCI_BW=mlx5_0:1000MB/s ./examples/ucp_client_server -c tag -a x.x.x.x
However did not manage to get the same assert in gtests. The assert is:
wireup.c:340 Assertion `ep_addr_index < address->num_ep_addrs' failed: ep_addr_index=1 num_ep_addrs=1
wireup.c: [ ucp_wireup_match_p2p_lanes() ]
...
335 address_index = addr_indices[lane];
336 address = &remote_address->address_list[address_index];
337 ep_addr_index = ep_addr_indexes[address_index]++;
==> 338 ucs_assertv(ep_addr_index < address->num_ep_addrs,
339 "ep_addr_index=%u num_ep_addrs=%u",
340 ep_addr_index, address->num_ep_addrs);
341 remote_lane = address->ep_addrs[ep_addr_index].lane;
==== backtrace (tid: 104090) ====
0 0x000000000010fdf9 ucp_wireup_match_p2p_lanes() /labhome/mikhailb/wgit/ucx-tmp/src/ucp/wireup/wireup.c:338
1 0x00000000001146cb ucp_wireup_process_request() /labhome/mikhailb/wgit/ucx-tmp/src/ucp/wireup/wireup.c:589
2 0x00000000001159b9 ucp_wireup_msg_handler() /labhome/mikhailb/wgit/ucx-tmp/src/ucp/wireup/wireup.c:836
3 0x00000000000548e8 uct_iface_invoke_am() /labhome/mikhailb/wgit/ucx-tmp/src/uct/base/uct_iface.h:883
4 0x00000000000548e8 uct_rc_mlx5_iface_common_am_handler() /labhome/mikhailb/wgit/ucx-tmp/src/uct/ib/rc/accel/rc_mlx5.inl:419
5 0x00000000000548e8 uct_rc_mlx5_iface_common_poll_rx() /labhome/mikhailb/wgit/ucx-tmp/src/uct/ib/rc/accel/rc_mlx5.inl:1457
6 0x00000000000548e8 uct_rc_mlx5_iface_progress() /labhome/mikhailb/wgit/ucx-tmp/src/uct/ib/rc/accel/rc_mlx5_iface.c:182
7 0x00000000000548e8 uct_rc_mlx5_iface_progress_cyclic() /labhome/mikhailb/wgit/ucx-tmp/src/uct/ib/rc/accel/rc_mlx5_iface.c:192
8 0x00000000000642e2 ucs_callbackq_dispatch() /labhome/mikhailb/wgit/ucx-tmp/src/ucs/datastruct/callbackq.h:211
9 0x00000000000642e2 uct_worker_progress() /labhome/mikhailb/wgit/ucx-tmp/src/uct/api/uct.h:2647
10 0x00000000000642e2 ucp_worker_progress() /labhome/mikhailb/wgit/ucx-tmp/src/ucp/core/ucp_worker.c:2804
11 0x00000000004023e0 request_wait() /labhome/mikhailb/wgit/ucx-tmp/examples/ucp_client_server.c:334
12 0x00000000004023e0 request_finalize() /labhome/mikhailb/wgit/ucx-tmp/examples/ucp_client_server.c:350
13 0x00000000004038ef send_recv_tag() /labhome/mikhailb/wgit/ucx-tmp/examples/ucp_client_server.c:467
14 0x00000000004041d3 client_server_communication() /labhome/mikhailb/wgit/ucx-tmp/examples/ucp_client_server.c:736
15 0x00000000004041d3 run_server() /labhome/mikhailb/wgit/ucx-tmp/examples/ucp_client_server.c:975
16 0x0000000000401b67 main() /labhome/mikhailb/wgit/ucx-tmp/examples/ucp_client_server.c:1093
17 0x00000000000223d5 __libc_start_main() ???:0
18 0x0000000000401bb4 _start() ???:0
=================================
@brminich the following still fails with this PR, is it expected?
$ taskset -c 18,19 make -C build-devel/test/gtest test GTEST_FILTER=rcx/test_ucp_am_nbx_seg_size.single/7 GTEST_REPEAT=1000
...
Repeating all tests (iteration 29) . . .
Note: Google Test filter = rcx/test_ucp_am_nbx_seg_size.single/7
Note: Randomizing tests' orders with a seed of 19869 .
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from rcx/test_ucp_am_nbx_seg_size
[ RUN ] rcx/test_ucp_am_nbx_seg_size.single/7 <rc_x/prereg/proto/reply>
[ INFO ] seg size 1024 data size 512
[1661087849.623034] [swx-rain04:148009:0] wireup.c:404 UCX ERROR ep 0x7f9aa897f000: no remote ep address for lane[2]->remote_lane[2]
[1661087859.654742] [swx-rain04:148009:0] ucp_test.cc:301 UCX ERROR request 0x4746750 completed with error Destination is unreachable
/.autodirect/rdmzsysgwork/yosefe/ucx/test/gtest/ucp/test_ucp_am.cc:489: Failure
Value of: m_am_received
Actual: false
Expected: true
[1661087859.654859] [swx-rain04:148009:0] flush.c:28 UCX ERROR req 0x4746640: error during flush: Endpoint timeout, flush comp 0x47466d8 count reduced to 2
[1661087859.654867] [swx-rain04:148009:0] flush.c:28 UCX ERROR req 0x4746640: error during flush: Endpoint timeout, flush comp 0x47466d8 count reduced to 1
[1661087859.654872] [swx-rain04:148009:0] flush.c:28 UCX ERROR req 0x4746640: error during flush: Endpoint timeout, flush comp 0x47466d8 count reduced to 0
@brminich the following still fails with this PR, is it expected?
No, will check it
@brminich the following still fails with this PR, is it expected?
$ taskset -c 18,19 make -C build-devel/test/gtest test GTEST_FILTER=rcx/test_ucp_am_nbx_seg_size.single/7 GTEST_REPEAT=1000 ...
I can reproduce it without changes in this PR and RoCE LAG BW increase with the following change:
diff --git a/test/gtest/ucp/test_ucp_am.cc b/test/gtest/ucp/test_ucp_am.cc
index f0661aaa3..afbc582a1 100644
--- a/test/gtest/ucp/test_ucp_am.cc
+++ b/test/gtest/ucp/test_ucp_am.cc
@@ -1138,8 +1138,9 @@ public:
void init()
{
- m_size = ucs_max(UCS_KBYTE,
- ucs::rand() % (64 * UCS_KBYTE));
+ //m_size = ucs_max(UCS_KBYTE,
+ // ucs::rand() % (64 * UCS_KBYTE));
+ m_size = 411;
std::string str_size = ucs::to_string(m_size);
test_ucp_am_nbx::init();
root cause seems to be clear, working on a fix
@brminich the following still fails with this PR, is it expected?
@yosefe fixed in #8472
@yosefe, force-pushed just a single fix for am lane reusage by BW lanes
@yosefe, updated in place, because PR is small enough