ucx icon indicating copy to clipboard operation
ucx copied to clipboard

UCP/WIREUP: Increase UCP_MAX_LANES to 64

Open ivankochin opened this issue 2 years ago • 12 comments

What

Increase number of maximal lanes per EP to 64.

Why ?

To allow collecting information about all lanes on systems with many transports/devices.

ivankochin avatar Apr 11 '24 09:04 ivankochin

Would it make sense to have this parameter as a configure option?

edgargabriel avatar Apr 14 '24 19:04 edgargabriel

Would it make sense to have this parameter as a configure option?

I think not, we should be able to find a solution to set upper limit to 64 w/o extra overheads when the actual number of lanes is small

yosefe avatar Apr 15 '24 07:04 yosefe

test failure seems relevant (with ASAN):

2024-05-15T11:43:42.5740013Z [ RUN      ] rc/multi_rail_max.max_lanes/7 <rc/proto_v1>
2024-05-15T11:43:43.2815747Z [     INFO ] lane[0] : sender 115 receiver 17
2024-05-15T11:43:43.2817074Z [     INFO ] lane[1] : sender 0 receiver 4608
2024-05-15T11:43:43.2817776Z [     INFO ] lane[2] : sender 0 receiver 4608
2024-05-15T11:43:43.2818286Z [     INFO ] lane[3] : sender 0 receiver 4608
2024-05-15T11:43:43.2818622Z [     INFO ] lane[4] : sender 0 receiver 4608
2024-05-15T11:43:43.2818900Z [     INFO ] lane[5] : sender 0 receiver 4608
2024-05-15T11:43:43.2819178Z [     INFO ] lane[6] : sender 0 receiver 4608
2024-05-15T11:43:43.2819441Z [     INFO ] lane[7] : sender 0 receiver 4608
2024-05-15T11:43:43.2819712Z [     INFO ] lane[8] : sender 0 receiver 4608
2024-05-15T11:43:43.2819985Z [     INFO ] lane[9] : sender 0 receiver 4608
2024-05-15T11:43:43.2820261Z [     INFO ] lane[10] : sender 0 receiver 4608
2024-05-15T11:43:43.2820527Z [     INFO ] lane[11] : sender 0 receiver 4608
2024-05-15T11:43:43.2820803Z [     INFO ] lane[12] : sender 0 receiver 4608
2024-05-15T11:43:43.2821089Z [     INFO ] lane[13] : sender 0 receiver 4608
2024-05-15T11:43:43.2821364Z [     INFO ] lane[14] : sender 0 receiver 4608
2024-05-15T11:43:43.2821619Z [     INFO ] lane[15] : sender 0 receiver 4608
2024-05-15T11:43:43.2821891Z [     INFO ] lane[16] : sender 0 receiver 4608
2024-05-15T11:43:43.2822163Z [     INFO ] lane[17] : sender 0 receiver 4608
2024-05-15T11:43:43.2822434Z [     INFO ] lane[18] : sender 0 receiver 4608
2024-05-15T11:43:43.2822707Z [     INFO ] lane[19] : sender 0 receiver 4608
2024-05-15T11:43:43.2822967Z [     INFO ] lane[20] : sender 0 receiver 4608
2024-05-15T11:43:43.2823339Z [     INFO ] lane[21] : sender 0 receiver 4608
2024-05-15T11:43:43.2823692Z [     INFO ] lane[22] : sender 0 receiver 4608
2024-05-15T11:43:43.2823979Z [     INFO ] lane[23] : sender 0 receiver 4608
2024-05-15T11:43:43.2824239Z [     INFO ] lane[24] : sender 0 receiver 4608
2024-05-15T11:43:43.2824508Z [     INFO ] lane[25] : sender 0 receiver 4608
2024-05-15T11:43:43.2825385Z [     INFO ] lane[26] : sender 0 receiver 4608
2024-05-15T11:43:43.2825933Z [     INFO ] lane[27] : sender 0 receiver 4608
2024-05-15T11:43:43.2826546Z [     INFO ] lane[28] : sender 0 receiver 4608
2024-05-15T11:43:43.2827195Z [     INFO ] lane[29] : sender 0 receiver 4608
2024-05-15T11:43:43.2827726Z [     INFO ] lane[30] : sender 0 receiver 4608
2024-05-15T11:43:43.2828259Z [     INFO ] lane[31] : sender 0 receiver 4608
2024-05-15T11:43:43.2828766Z [     INFO ] lane[32] : sender 0 receiver 4608
2024-05-15T11:43:43.2829287Z [     INFO ] lane[33] : sender 0 receiver 4608
2024-05-15T11:43:43.2829817Z [     INFO ] lane[34] : sender 0 receiver 4608
2024-05-15T11:43:43.2830376Z [     INFO ] lane[35] : sender 0 receiver 4608
2024-05-15T11:43:43.2830911Z [     INFO ] lane[36] : sender 0 receiver 4608
2024-05-15T11:43:43.2831429Z [     INFO ] lane[37] : sender 0 receiver 4608
2024-05-15T11:43:43.2831966Z [     INFO ] lane[38] : sender 0 receiver 4608
2024-05-15T11:43:43.2832500Z [     INFO ] lane[39] : sender 0 receiver 4608
2024-05-15T11:43:43.2833047Z [     INFO ] lane[40] : sender 0 receiver 4608
2024-05-15T11:43:43.2833579Z [     INFO ] lane[41] : sender 0 receiver 4608
2024-05-15T11:43:43.2834116Z [     INFO ] lane[42] : sender 0 receiver 4608
2024-05-15T11:43:43.2834656Z [     INFO ] lane[43] : sender 0 receiver 4608
2024-05-15T11:43:43.2835244Z [     INFO ] lane[44] : sender 0 receiver 4608
2024-05-15T11:43:43.2835780Z [     INFO ] lane[45] : sender 0 receiver 4608
2024-05-15T11:43:43.2836315Z [     INFO ] lane[46] : sender 0 receiver 4608
2024-05-15T11:43:43.2836852Z [     INFO ] lane[47] : sender 0 receiver 4608
2024-05-15T11:43:43.2837385Z [     INFO ] lane[48] : sender 0 receiver 4608
2024-05-15T11:43:43.2837919Z [     INFO ] lane[49] : sender 0 receiver 4608
2024-05-15T11:43:43.2838443Z [     INFO ] lane[50] : sender 0 receiver 4608
2024-05-15T11:43:43.2838981Z [     INFO ] lane[51] : sender 0 receiver 4608
2024-05-15T11:43:43.2839630Z [     INFO ] lane[52] : sender 0 receiver 4608
2024-05-15T11:43:43.2840165Z [     INFO ] lane[53] : sender 0 receiver 4608
2024-05-15T11:43:43.2840683Z [     INFO ] lane[54] : sender 0 receiver 4608
2024-05-15T11:43:43.2841231Z [     INFO ] lane[55] : sender 0 receiver 4608
2024-05-15T11:43:43.2841762Z [     INFO ] lane[56] : sender 0 receiver 4608
2024-05-15T11:43:43.2842291Z [     INFO ] lane[57] : sender 0 receiver 4096
2024-05-15T11:43:43.2842807Z [     INFO ] lane[58] : sender 0 receiver 0
2024-05-15T11:43:43.2843428Z /__w/2/s/contrib/../test/gtest/ucp/test_ucp_tag_xfer.cc:1265: Failure
2024-05-15T11:43:43.2844103Z Expected: (sender_tx + receiver_tx) >= (chunk_size), actual: 0 vs 4096
2024-05-15T11:43:43.2844668Z [     INFO ] lane[59] : sender 0 receiver 0
2024-05-15T11:43:43.2845286Z /__w/2/s/contrib/../test/gtest/ucp/test_ucp_tag_xfer.cc:1265: Failure
2024-05-15T11:43:43.2845874Z Expected: (sender_tx + receiver_tx) >= (chunk_size), actual: 0 vs 4096
2024-05-15T11:43:43.2846440Z [     INFO ] lane[60] : sender 0 receiver 0
2024-05-15T11:43:43.2847009Z /__w/2/s/contrib/../test/gtest/ucp/test_ucp_tag_xfer.cc:1265: Failure
2024-05-15T11:43:43.2847603Z Expected: (sender_tx + receiver_tx) >= (chunk_size), actual: 0 vs 4096
2024-05-15T11:43:43.2848146Z [     INFO ] lane[61] : sender 0 receiver 0
2024-05-15T11:43:43.2848716Z /__w/2/s/contrib/../test/gtest/ucp/test_ucp_tag_xfer.cc:1265: Failure
2024-05-15T11:43:43.2849306Z Expected: (sender_tx + receiver_tx) >= (chunk_size), actual: 0 vs 4096
2024-05-15T11:43:43.2849864Z [     INFO ] lane[62] : sender 0 receiver 0
2024-05-15T11:43:43.2850427Z /__w/2/s/contrib/../test/gtest/ucp/test_ucp_tag_xfer.cc:1265: Failure
2024-05-15T11:43:43.2851004Z Expected: (sender_tx + receiver_tx) >= (chunk_size), actual: 0 vs 4096
2024-05-15T11:43:43.2851566Z [     INFO ] lane[63] : sender 0 receiver 0
2024-05-15T11:43:43.2852131Z /__w/2/s/contrib/../test/gtest/ucp/test_ucp_tag_xfer.cc:1263: Failure
2024-05-15T11:43:43.2852709Z Expected: (sender_tx + receiver_tx) > (0), actual: 0 vs 0
2024-05-15T11:43:43.6603087Z [  FAILED  ] rc/multi_rail_max.max_lanes/7, where GetParam() = rc/proto_v1 (1096 ms)

yosefe avatar May 15 '24 14:05 yosefe

test failure seems relevant (with ASAN):

2024-05-15T11:43:42.5740013Z [ RUN      ] rc/multi_rail_max.max_lanes/7 <rc/proto_v1>
2024-05-15T11:43:43.2815747Z [     INFO ] lane[0] : sender 115 receiver 17
2024-05-15T11:43:43.2817074Z [     INFO ] lane[1] : sender 0 receiver 4608
2024-05-15T11:43:43.2817776Z [     INFO ] lane[2] : sender 0 receiver 4608
2024-05-15T11:43:43.2818286Z [     INFO ] lane[3] : sender 0 receiver 4608
2024-05-15T11:43:43.2818622Z [     INFO ] lane[4] : sender 0 receiver 4608
2024-05-15T11:43:43.2818900Z [     INFO ] lane[5] : sender 0 receiver 4608
2024-05-15T11:43:43.2819178Z [     INFO ] lane[6] : sender 0 receiver 4608
2024-05-15T11:43:43.2819441Z [     INFO ] lane[7] : sender 0 receiver 4608
2024-05-15T11:43:43.2819712Z [     INFO ] lane[8] : sender 0 receiver 4608
2024-05-15T11:43:43.2819985Z [     INFO ] lane[9] : sender 0 receiver 4608
2024-05-15T11:43:43.2820261Z [     INFO ] lane[10] : sender 0 receiver 4608
2024-05-15T11:43:43.2820527Z [     INFO ] lane[11] : sender 0 receiver 4608
2024-05-15T11:43:43.2820803Z [     INFO ] lane[12] : sender 0 receiver 4608
2024-05-15T11:43:43.2821089Z [     INFO ] lane[13] : sender 0 receiver 4608
2024-05-15T11:43:43.2821364Z [     INFO ] lane[14] : sender 0 receiver 4608
2024-05-15T11:43:43.2821619Z [     INFO ] lane[15] : sender 0 receiver 4608
2024-05-15T11:43:43.2821891Z [     INFO ] lane[16] : sender 0 receiver 4608
2024-05-15T11:43:43.2822163Z [     INFO ] lane[17] : sender 0 receiver 4608
2024-05-15T11:43:43.2822434Z [     INFO ] lane[18] : sender 0 receiver 4608
2024-05-15T11:43:43.2822707Z [     INFO ] lane[19] : sender 0 receiver 4608
2024-05-15T11:43:43.2822967Z [     INFO ] lane[20] : sender 0 receiver 4608
2024-05-15T11:43:43.2823339Z [     INFO ] lane[21] : sender 0 receiver 4608
2024-05-15T11:43:43.2823692Z [     INFO ] lane[22] : sender 0 receiver 4608
2024-05-15T11:43:43.2823979Z [     INFO ] lane[23] : sender 0 receiver 4608
2024-05-15T11:43:43.2824239Z [     INFO ] lane[24] : sender 0 receiver 4608
2024-05-15T11:43:43.2824508Z [     INFO ] lane[25] : sender 0 receiver 4608
2024-05-15T11:43:43.2825385Z [     INFO ] lane[26] : sender 0 receiver 4608
2024-05-15T11:43:43.2825933Z [     INFO ] lane[27] : sender 0 receiver 4608
2024-05-15T11:43:43.2826546Z [     INFO ] lane[28] : sender 0 receiver 4608
2024-05-15T11:43:43.2827195Z [     INFO ] lane[29] : sender 0 receiver 4608
2024-05-15T11:43:43.2827726Z [     INFO ] lane[30] : sender 0 receiver 4608
2024-05-15T11:43:43.2828259Z [     INFO ] lane[31] : sender 0 receiver 4608
2024-05-15T11:43:43.2828766Z [     INFO ] lane[32] : sender 0 receiver 4608
2024-05-15T11:43:43.2829287Z [     INFO ] lane[33] : sender 0 receiver 4608
2024-05-15T11:43:43.2829817Z [     INFO ] lane[34] : sender 0 receiver 4608
2024-05-15T11:43:43.2830376Z [     INFO ] lane[35] : sender 0 receiver 4608
2024-05-15T11:43:43.2830911Z [     INFO ] lane[36] : sender 0 receiver 4608
2024-05-15T11:43:43.2831429Z [     INFO ] lane[37] : sender 0 receiver 4608
2024-05-15T11:43:43.2831966Z [     INFO ] lane[38] : sender 0 receiver 4608
2024-05-15T11:43:43.2832500Z [     INFO ] lane[39] : sender 0 receiver 4608
2024-05-15T11:43:43.2833047Z [     INFO ] lane[40] : sender 0 receiver 4608
2024-05-15T11:43:43.2833579Z [     INFO ] lane[41] : sender 0 receiver 4608
2024-05-15T11:43:43.2834116Z [     INFO ] lane[42] : sender 0 receiver 4608
2024-05-15T11:43:43.2834656Z [     INFO ] lane[43] : sender 0 receiver 4608
2024-05-15T11:43:43.2835244Z [     INFO ] lane[44] : sender 0 receiver 4608
2024-05-15T11:43:43.2835780Z [     INFO ] lane[45] : sender 0 receiver 4608
2024-05-15T11:43:43.2836315Z [     INFO ] lane[46] : sender 0 receiver 4608
2024-05-15T11:43:43.2836852Z [     INFO ] lane[47] : sender 0 receiver 4608
2024-05-15T11:43:43.2837385Z [     INFO ] lane[48] : sender 0 receiver 4608
2024-05-15T11:43:43.2837919Z [     INFO ] lane[49] : sender 0 receiver 4608
2024-05-15T11:43:43.2838443Z [     INFO ] lane[50] : sender 0 receiver 4608
2024-05-15T11:43:43.2838981Z [     INFO ] lane[51] : sender 0 receiver 4608
2024-05-15T11:43:43.2839630Z [     INFO ] lane[52] : sender 0 receiver 4608
2024-05-15T11:43:43.2840165Z [     INFO ] lane[53] : sender 0 receiver 4608
2024-05-15T11:43:43.2840683Z [     INFO ] lane[54] : sender 0 receiver 4608
2024-05-15T11:43:43.2841231Z [     INFO ] lane[55] : sender 0 receiver 4608
2024-05-15T11:43:43.2841762Z [     INFO ] lane[56] : sender 0 receiver 4608
2024-05-15T11:43:43.2842291Z [     INFO ] lane[57] : sender 0 receiver 4096
2024-05-15T11:43:43.2842807Z [     INFO ] lane[58] : sender 0 receiver 0
2024-05-15T11:43:43.2843428Z /__w/2/s/contrib/../test/gtest/ucp/test_ucp_tag_xfer.cc:1265: Failure
2024-05-15T11:43:43.2844103Z Expected: (sender_tx + receiver_tx) >= (chunk_size), actual: 0 vs 4096
2024-05-15T11:43:43.2844668Z [     INFO ] lane[59] : sender 0 receiver 0
2024-05-15T11:43:43.2845286Z /__w/2/s/contrib/../test/gtest/ucp/test_ucp_tag_xfer.cc:1265: Failure
2024-05-15T11:43:43.2845874Z Expected: (sender_tx + receiver_tx) >= (chunk_size), actual: 0 vs 4096
2024-05-15T11:43:43.2846440Z [     INFO ] lane[60] : sender 0 receiver 0
2024-05-15T11:43:43.2847009Z /__w/2/s/contrib/../test/gtest/ucp/test_ucp_tag_xfer.cc:1265: Failure
2024-05-15T11:43:43.2847603Z Expected: (sender_tx + receiver_tx) >= (chunk_size), actual: 0 vs 4096
2024-05-15T11:43:43.2848146Z [     INFO ] lane[61] : sender 0 receiver 0
2024-05-15T11:43:43.2848716Z /__w/2/s/contrib/../test/gtest/ucp/test_ucp_tag_xfer.cc:1265: Failure
2024-05-15T11:43:43.2849306Z Expected: (sender_tx + receiver_tx) >= (chunk_size), actual: 0 vs 4096
2024-05-15T11:43:43.2849864Z [     INFO ] lane[62] : sender 0 receiver 0
2024-05-15T11:43:43.2850427Z /__w/2/s/contrib/../test/gtest/ucp/test_ucp_tag_xfer.cc:1265: Failure
2024-05-15T11:43:43.2851004Z Expected: (sender_tx + receiver_tx) >= (chunk_size), actual: 0 vs 4096
2024-05-15T11:43:43.2851566Z [     INFO ] lane[63] : sender 0 receiver 0
2024-05-15T11:43:43.2852131Z /__w/2/s/contrib/../test/gtest/ucp/test_ucp_tag_xfer.cc:1263: Failure
2024-05-15T11:43:43.2852709Z Expected: (sender_tx + receiver_tx) > (0), actual: 0 vs 0
2024-05-15T11:43:43.6603087Z [  FAILED  ] rc/multi_rail_max.max_lanes/7, where GetParam() = rc/proto_v1 (1096 ms)

@yosefe It seems related but I tried to reproduce it 100 times and test always ends successfully. I also reran the CI and it passed too. That's very strange, but I think it can be some configurational issue. How do you think, can we merge it now or should we try to reproduce it more?

ivankochin avatar May 16 '24 10:05 ivankochin

@yosefe It seems related but I tried to reproduce it 100 times and test always ends successfully. I also reran the CI and it passed too. That's very strange, but I think it can be some configurational issue. How do you think, can we merge it now or should we try to reproduce it more?

One weird thing i see here is we expect 64 lanes with protov1 test. Maybe it's wrong?

yosefe avatar May 16 '24 10:05 yosefe

One weird thing i see here is we expect 64 lanes with protov1 test. Maybe it's wrong?

We set MAX_RNDV_LANES=64 in that test case, so if I understand the logic correctly, it is OK to expect 64 lanes.

ivankochin avatar May 16 '24 11:05 ivankochin

i think we should limit protov1 tests to 16 lanes, since there may be places in protov1 flows we are not updating to support more lanes

yosefe avatar May 16 '24 11:05 yosefe

/azp run

yosefe avatar May 17 '24 08:05 yosefe

Azure Pipelines successfully started running 4 pipeline(s).

azure-pipelines[bot] avatar May 17 '24 08:05 azure-pipelines[bot]

/azp run

yosefe avatar May 19 '24 05:05 yosefe

Azure Pipelines successfully started running 4 pipeline(s).

azure-pipelines[bot] avatar May 19 '24 05:05 azure-pipelines[bot]

we need to allow 64 lanes starting from v1.18 only to preserve wire-compat

It can be done by adding UCX_MAX_LANES control with auto default value which would also allow users to configure number of lanes limit

ivankochin avatar May 21 '24 08:05 ivankochin

Failure seems relevant and caused by changing message size in the test. Some tests set RNDV_THRESH to specific value and expect the message to be transferred by eager protocol https://github.com/openucx/ucx/blob/e8c7a6cac155bc801f5ae9e7adbe879f9d07158c/test/gtest/ucp/test_ucp_tag_xfer.cc#L1142

https://dev.azure.com/ucfconsort/0b36e3f0-8ab9-4a48-b68b-4b2350e02c88/_apis/build/builds/81138/logs/446

2024-05-24T13:39:43.3371935Z /scrap/azure/agent-02/AZP_WORKSPACE/2/s/contrib/../test/gtest/ucp/test_ucp_tag_xfer.cc:1086: Failure
2024-05-24T13:39:43.3373318Z Expected equality of these values:
2024-05-24T13:39:43.3373921Z   1ul
2024-05-24T13:39:43.3374364Z     Which is: 1
2024-05-24T13:39:43.3374768Z   cnt
2024-05-24T13:39:43.3375165Z     Which is: 0
2024-05-24T13:39:43.3375630Z TX counter
2024-05-24T13:39:43.3376237Z /scrap/azure/agent-02/AZP_WORKSPACE/2/s/contrib/../test/gtest/ucp/test_ucp_tag_xfer.cc:1088: Failure
2024-05-24T13:39:43.3376785Z Expected equality of these values:
2024-05-24T13:39:43.3377197Z   1ul
2024-05-24T13:39:43.3377596Z     Which is: 1
2024-05-24T13:39:43.3377968Z   cnt
2024-05-24T13:39:43.3378337Z     Which is: 0
2024-05-24T13:39:43.3378758Z RX counter
2024-05-24T13:39:43.3462848Z [  FAILED  ] tcp/test_ucp_tag_stats.eager_expected/0, where GetParam() = tcp (44 ms)

yosefe avatar May 26 '24 08:05 yosefe

@ivankochin pls check https://dev.azure.com/ucfconsort/ucx/_build/results?buildId=81183&view=logs&j=cc064a77-22b5-56bf-ecc0-70b5fe764261&t=aedfd754-44a6-53e6-6843-8659d800fee2 - code style

yosefe avatar May 27 '24 07:05 yosefe

@ivankochin pls check https://dev.azure.com/ucfconsort/ucx/_build/results?buildId=81183&view=logs&j=cc064a77-22b5-56bf-ecc0-70b5fe764261&t=aedfd754-44a6-53e6-6843-8659d800fee2 - code style

Done.

ivankochin avatar May 27 '24 11:05 ivankochin

@yosefe can I squash?

ivankochin avatar May 27 '24 12:05 ivankochin

@ivankochin @yosefe Did the changes from the PR go into v1.17.x ?

Akshay-Venkatesh avatar Jul 02 '24 00:07 Akshay-Venkatesh