ucx icon indicating copy to clipboard operation
ucx copied to clipboard

UCP: Ignore path idx for non-bw lanes

Open brminich opened this issue 2 years ago • 2 comments

What

Ignore path indexes when adding new lane and there is existing non-bw lane with the same tl resources

Why ?

The problem shown on the picture below may appear on heterogeneous configurations, like:

  • only one of nodes has bonding device/interface configured
  • different number of paths used by devices on different peers
  • etc

path_idx_pr

brminich avatar Jul 15 '22 10:07 brminich

@brminich how was this issue found and can be reproduced? Is this ep reconfiguration issue? does it mean AM lane will use 2nd path now?

yosefe avatar Jul 15 '22 17:07 yosefe

@brminich how was this issue found and can be reproduced? Is this ep reconfiguration issue? does it mean AM lane will use 2nd path now?

I got an assert with OSU on venus when one of the nodes was in some incorrect state. Unfortunately I lost that config to continue debugging, but I managed to reproduce it on another cluster with ucp_client_server (but I had to provide different params to the client and server):

server: UCX_NET_DEVICES=mlx5_0:1,mlx5_4:1 UCX_LOG_LEVEL=info UCX_IB_NUM_PATHS=2 ./examples/ucp_client_server -c tag

client:

 UCX_NET_DEVICES=mlx5_0:1,mlx5_4:1 UCX_LOG_LEVEL=info UCX_IB_NUM_PATHS=1 UCX_IB_PCI_BW=mlx5_0:1000MB/s  ./examples/ucp_client_server -c tag -a x.x.x.x

However did not manage to get the same assert in gtests. The assert is:

wireup.c:340  Assertion `ep_addr_index < address->num_ep_addrs' failed: ep_addr_index=1 num_ep_addrs=1

wireup.c: [ ucp_wireup_match_p2p_lanes() ]
      ...
      335         address_index      = addr_indices[lane];
      336         address            = &remote_address->address_list[address_index];
      337         ep_addr_index      = ep_addr_indexes[address_index]++;
==>   338         ucs_assertv(ep_addr_index < address->num_ep_addrs,
      339                     "ep_addr_index=%u num_ep_addrs=%u",
      340                     ep_addr_index, address->num_ep_addrs);
      341         remote_lane        = address->ep_addrs[ep_addr_index].lane;

==== backtrace (tid: 104090) ====
 0 0x000000000010fdf9 ucp_wireup_match_p2p_lanes()  /labhome/mikhailb/wgit/ucx-tmp/src/ucp/wireup/wireup.c:338
 1 0x00000000001146cb ucp_wireup_process_request()  /labhome/mikhailb/wgit/ucx-tmp/src/ucp/wireup/wireup.c:589
 2 0x00000000001159b9 ucp_wireup_msg_handler()  /labhome/mikhailb/wgit/ucx-tmp/src/ucp/wireup/wireup.c:836
 3 0x00000000000548e8 uct_iface_invoke_am()  /labhome/mikhailb/wgit/ucx-tmp/src/uct/base/uct_iface.h:883
 4 0x00000000000548e8 uct_rc_mlx5_iface_common_am_handler()  /labhome/mikhailb/wgit/ucx-tmp/src/uct/ib/rc/accel/rc_mlx5.inl:419
 5 0x00000000000548e8 uct_rc_mlx5_iface_common_poll_rx()  /labhome/mikhailb/wgit/ucx-tmp/src/uct/ib/rc/accel/rc_mlx5.inl:1457
 6 0x00000000000548e8 uct_rc_mlx5_iface_progress()  /labhome/mikhailb/wgit/ucx-tmp/src/uct/ib/rc/accel/rc_mlx5_iface.c:182
 7 0x00000000000548e8 uct_rc_mlx5_iface_progress_cyclic()  /labhome/mikhailb/wgit/ucx-tmp/src/uct/ib/rc/accel/rc_mlx5_iface.c:192
 8 0x00000000000642e2 ucs_callbackq_dispatch()  /labhome/mikhailb/wgit/ucx-tmp/src/ucs/datastruct/callbackq.h:211
 9 0x00000000000642e2 uct_worker_progress()  /labhome/mikhailb/wgit/ucx-tmp/src/uct/api/uct.h:2647
10 0x00000000000642e2 ucp_worker_progress()  /labhome/mikhailb/wgit/ucx-tmp/src/ucp/core/ucp_worker.c:2804
11 0x00000000004023e0 request_wait()  /labhome/mikhailb/wgit/ucx-tmp/examples/ucp_client_server.c:334
12 0x00000000004023e0 request_finalize()  /labhome/mikhailb/wgit/ucx-tmp/examples/ucp_client_server.c:350
13 0x00000000004038ef send_recv_tag()  /labhome/mikhailb/wgit/ucx-tmp/examples/ucp_client_server.c:467
14 0x00000000004041d3 client_server_communication()  /labhome/mikhailb/wgit/ucx-tmp/examples/ucp_client_server.c:736
15 0x00000000004041d3 run_server()  /labhome/mikhailb/wgit/ucx-tmp/examples/ucp_client_server.c:975
16 0x0000000000401b67 main()  /labhome/mikhailb/wgit/ucx-tmp/examples/ucp_client_server.c:1093
17 0x00000000000223d5 __libc_start_main()  ???:0
18 0x0000000000401bb4 _start()  ???:0
=================================

brminich avatar Jul 16 '22 13:07 brminich

@brminich the following still fails with this PR, is it expected?

$ taskset -c 18,19 make -C build-devel/test/gtest test GTEST_FILTER=rcx/test_ucp_am_nbx_seg_size.single/7  GTEST_REPEAT=1000
...
Repeating all tests (iteration 29) . . .

Note: Google Test filter = rcx/test_ucp_am_nbx_seg_size.single/7
Note: Randomizing tests' orders with a seed of 19869 .
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from rcx/test_ucp_am_nbx_seg_size
[ RUN      ] rcx/test_ucp_am_nbx_seg_size.single/7 <rc_x/prereg/proto/reply>
[     INFO ] seg size 1024 data size 512
[1661087849.623034] [swx-rain04:148009:0]          wireup.c:404  UCX  ERROR   ep 0x7f9aa897f000: no remote ep address for lane[2]->remote_lane[2]
[1661087859.654742] [swx-rain04:148009:0]       ucp_test.cc:301  UCX  ERROR request 0x4746750 completed with error Destination is unreachable
/.autodirect/rdmzsysgwork/yosefe/ucx/test/gtest/ucp/test_ucp_am.cc:489: Failure
Value of: m_am_received
  Actual: false
Expected: true
[1661087859.654859] [swx-rain04:148009:0]           flush.c:28   UCX  ERROR req 0x4746640: error during flush: Endpoint timeout, flush comp 0x47466d8 count reduced to 2
[1661087859.654867] [swx-rain04:148009:0]           flush.c:28   UCX  ERROR req 0x4746640: error during flush: Endpoint timeout, flush comp 0x47466d8 count reduced to 1
[1661087859.654872] [swx-rain04:148009:0]           flush.c:28   UCX  ERROR req 0x4746640: error during flush: Endpoint timeout, flush comp 0x47466d8 count reduced to 0

yosefe avatar Aug 21 '22 13:08 yosefe

@brminich the following still fails with this PR, is it expected?

No, will check it

brminich avatar Aug 22 '22 06:08 brminich

@brminich the following still fails with this PR, is it expected?

$ taskset -c 18,19 make -C build-devel/test/gtest test GTEST_FILTER=rcx/test_ucp_am_nbx_seg_size.single/7  GTEST_REPEAT=1000
...

I can reproduce it without changes in this PR and RoCE LAG BW increase with the following change:

diff --git a/test/gtest/ucp/test_ucp_am.cc b/test/gtest/ucp/test_ucp_am.cc
index f0661aaa3..afbc582a1 100644
--- a/test/gtest/ucp/test_ucp_am.cc
+++ b/test/gtest/ucp/test_ucp_am.cc
@@ -1138,8 +1138,9 @@ public:

     void init()
     {
-        m_size               = ucs_max(UCS_KBYTE,
-                                       ucs::rand() % (64 * UCS_KBYTE));
+        //m_size               = ucs_max(UCS_KBYTE,
+          //                             ucs::rand() % (64 * UCS_KBYTE));
+        m_size = 411;
         std::string str_size = ucs::to_string(m_size);

         test_ucp_am_nbx::init();

root cause seems to be clear, working on a fix

brminich avatar Aug 23 '22 10:08 brminich

@brminich the following still fails with this PR, is it expected?

@yosefe fixed in #8472

brminich avatar Aug 23 '22 13:08 brminich

@yosefe, force-pushed just a single fix for am lane reusage by BW lanes

brminich avatar Sep 06 '22 06:09 brminich

@yosefe, updated in place, because PR is small enough

brminich avatar Sep 06 '22 07:09 brminich