UCP/WIREUP: Increase UCP_MAX_LANES to 64
What
Increase number of maximal lanes per EP to 64.
Why ?
To allow collecting information about all lanes on systems with many transports/devices.
Would it make sense to have this parameter as a configure option?
Would it make sense to have this parameter as a configure option?
I think not, we should be able to find a solution to set upper limit to 64 w/o extra overheads when the actual number of lanes is small
test failure seems relevant (with ASAN):
2024-05-15T11:43:42.5740013Z [ RUN ] rc/multi_rail_max.max_lanes/7 <rc/proto_v1>
2024-05-15T11:43:43.2815747Z [ INFO ] lane[0] : sender 115 receiver 17
2024-05-15T11:43:43.2817074Z [ INFO ] lane[1] : sender 0 receiver 4608
2024-05-15T11:43:43.2817776Z [ INFO ] lane[2] : sender 0 receiver 4608
2024-05-15T11:43:43.2818286Z [ INFO ] lane[3] : sender 0 receiver 4608
2024-05-15T11:43:43.2818622Z [ INFO ] lane[4] : sender 0 receiver 4608
2024-05-15T11:43:43.2818900Z [ INFO ] lane[5] : sender 0 receiver 4608
2024-05-15T11:43:43.2819178Z [ INFO ] lane[6] : sender 0 receiver 4608
2024-05-15T11:43:43.2819441Z [ INFO ] lane[7] : sender 0 receiver 4608
2024-05-15T11:43:43.2819712Z [ INFO ] lane[8] : sender 0 receiver 4608
2024-05-15T11:43:43.2819985Z [ INFO ] lane[9] : sender 0 receiver 4608
2024-05-15T11:43:43.2820261Z [ INFO ] lane[10] : sender 0 receiver 4608
2024-05-15T11:43:43.2820527Z [ INFO ] lane[11] : sender 0 receiver 4608
2024-05-15T11:43:43.2820803Z [ INFO ] lane[12] : sender 0 receiver 4608
2024-05-15T11:43:43.2821089Z [ INFO ] lane[13] : sender 0 receiver 4608
2024-05-15T11:43:43.2821364Z [ INFO ] lane[14] : sender 0 receiver 4608
2024-05-15T11:43:43.2821619Z [ INFO ] lane[15] : sender 0 receiver 4608
2024-05-15T11:43:43.2821891Z [ INFO ] lane[16] : sender 0 receiver 4608
2024-05-15T11:43:43.2822163Z [ INFO ] lane[17] : sender 0 receiver 4608
2024-05-15T11:43:43.2822434Z [ INFO ] lane[18] : sender 0 receiver 4608
2024-05-15T11:43:43.2822707Z [ INFO ] lane[19] : sender 0 receiver 4608
2024-05-15T11:43:43.2822967Z [ INFO ] lane[20] : sender 0 receiver 4608
2024-05-15T11:43:43.2823339Z [ INFO ] lane[21] : sender 0 receiver 4608
2024-05-15T11:43:43.2823692Z [ INFO ] lane[22] : sender 0 receiver 4608
2024-05-15T11:43:43.2823979Z [ INFO ] lane[23] : sender 0 receiver 4608
2024-05-15T11:43:43.2824239Z [ INFO ] lane[24] : sender 0 receiver 4608
2024-05-15T11:43:43.2824508Z [ INFO ] lane[25] : sender 0 receiver 4608
2024-05-15T11:43:43.2825385Z [ INFO ] lane[26] : sender 0 receiver 4608
2024-05-15T11:43:43.2825933Z [ INFO ] lane[27] : sender 0 receiver 4608
2024-05-15T11:43:43.2826546Z [ INFO ] lane[28] : sender 0 receiver 4608
2024-05-15T11:43:43.2827195Z [ INFO ] lane[29] : sender 0 receiver 4608
2024-05-15T11:43:43.2827726Z [ INFO ] lane[30] : sender 0 receiver 4608
2024-05-15T11:43:43.2828259Z [ INFO ] lane[31] : sender 0 receiver 4608
2024-05-15T11:43:43.2828766Z [ INFO ] lane[32] : sender 0 receiver 4608
2024-05-15T11:43:43.2829287Z [ INFO ] lane[33] : sender 0 receiver 4608
2024-05-15T11:43:43.2829817Z [ INFO ] lane[34] : sender 0 receiver 4608
2024-05-15T11:43:43.2830376Z [ INFO ] lane[35] : sender 0 receiver 4608
2024-05-15T11:43:43.2830911Z [ INFO ] lane[36] : sender 0 receiver 4608
2024-05-15T11:43:43.2831429Z [ INFO ] lane[37] : sender 0 receiver 4608
2024-05-15T11:43:43.2831966Z [ INFO ] lane[38] : sender 0 receiver 4608
2024-05-15T11:43:43.2832500Z [ INFO ] lane[39] : sender 0 receiver 4608
2024-05-15T11:43:43.2833047Z [ INFO ] lane[40] : sender 0 receiver 4608
2024-05-15T11:43:43.2833579Z [ INFO ] lane[41] : sender 0 receiver 4608
2024-05-15T11:43:43.2834116Z [ INFO ] lane[42] : sender 0 receiver 4608
2024-05-15T11:43:43.2834656Z [ INFO ] lane[43] : sender 0 receiver 4608
2024-05-15T11:43:43.2835244Z [ INFO ] lane[44] : sender 0 receiver 4608
2024-05-15T11:43:43.2835780Z [ INFO ] lane[45] : sender 0 receiver 4608
2024-05-15T11:43:43.2836315Z [ INFO ] lane[46] : sender 0 receiver 4608
2024-05-15T11:43:43.2836852Z [ INFO ] lane[47] : sender 0 receiver 4608
2024-05-15T11:43:43.2837385Z [ INFO ] lane[48] : sender 0 receiver 4608
2024-05-15T11:43:43.2837919Z [ INFO ] lane[49] : sender 0 receiver 4608
2024-05-15T11:43:43.2838443Z [ INFO ] lane[50] : sender 0 receiver 4608
2024-05-15T11:43:43.2838981Z [ INFO ] lane[51] : sender 0 receiver 4608
2024-05-15T11:43:43.2839630Z [ INFO ] lane[52] : sender 0 receiver 4608
2024-05-15T11:43:43.2840165Z [ INFO ] lane[53] : sender 0 receiver 4608
2024-05-15T11:43:43.2840683Z [ INFO ] lane[54] : sender 0 receiver 4608
2024-05-15T11:43:43.2841231Z [ INFO ] lane[55] : sender 0 receiver 4608
2024-05-15T11:43:43.2841762Z [ INFO ] lane[56] : sender 0 receiver 4608
2024-05-15T11:43:43.2842291Z [ INFO ] lane[57] : sender 0 receiver 4096
2024-05-15T11:43:43.2842807Z [ INFO ] lane[58] : sender 0 receiver 0
2024-05-15T11:43:43.2843428Z /__w/2/s/contrib/../test/gtest/ucp/test_ucp_tag_xfer.cc:1265: Failure
2024-05-15T11:43:43.2844103Z Expected: (sender_tx + receiver_tx) >= (chunk_size), actual: 0 vs 4096
2024-05-15T11:43:43.2844668Z [ INFO ] lane[59] : sender 0 receiver 0
2024-05-15T11:43:43.2845286Z /__w/2/s/contrib/../test/gtest/ucp/test_ucp_tag_xfer.cc:1265: Failure
2024-05-15T11:43:43.2845874Z Expected: (sender_tx + receiver_tx) >= (chunk_size), actual: 0 vs 4096
2024-05-15T11:43:43.2846440Z [ INFO ] lane[60] : sender 0 receiver 0
2024-05-15T11:43:43.2847009Z /__w/2/s/contrib/../test/gtest/ucp/test_ucp_tag_xfer.cc:1265: Failure
2024-05-15T11:43:43.2847603Z Expected: (sender_tx + receiver_tx) >= (chunk_size), actual: 0 vs 4096
2024-05-15T11:43:43.2848146Z [ INFO ] lane[61] : sender 0 receiver 0
2024-05-15T11:43:43.2848716Z /__w/2/s/contrib/../test/gtest/ucp/test_ucp_tag_xfer.cc:1265: Failure
2024-05-15T11:43:43.2849306Z Expected: (sender_tx + receiver_tx) >= (chunk_size), actual: 0 vs 4096
2024-05-15T11:43:43.2849864Z [ INFO ] lane[62] : sender 0 receiver 0
2024-05-15T11:43:43.2850427Z /__w/2/s/contrib/../test/gtest/ucp/test_ucp_tag_xfer.cc:1265: Failure
2024-05-15T11:43:43.2851004Z Expected: (sender_tx + receiver_tx) >= (chunk_size), actual: 0 vs 4096
2024-05-15T11:43:43.2851566Z [ INFO ] lane[63] : sender 0 receiver 0
2024-05-15T11:43:43.2852131Z /__w/2/s/contrib/../test/gtest/ucp/test_ucp_tag_xfer.cc:1263: Failure
2024-05-15T11:43:43.2852709Z Expected: (sender_tx + receiver_tx) > (0), actual: 0 vs 0
2024-05-15T11:43:43.6603087Z [ FAILED ] rc/multi_rail_max.max_lanes/7, where GetParam() = rc/proto_v1 (1096 ms)
test failure seems relevant (with ASAN):
2024-05-15T11:43:42.5740013Z [ RUN ] rc/multi_rail_max.max_lanes/7 <rc/proto_v1> 2024-05-15T11:43:43.2815747Z [ INFO ] lane[0] : sender 115 receiver 17 2024-05-15T11:43:43.2817074Z [ INFO ] lane[1] : sender 0 receiver 4608 2024-05-15T11:43:43.2817776Z [ INFO ] lane[2] : sender 0 receiver 4608 2024-05-15T11:43:43.2818286Z [ INFO ] lane[3] : sender 0 receiver 4608 2024-05-15T11:43:43.2818622Z [ INFO ] lane[4] : sender 0 receiver 4608 2024-05-15T11:43:43.2818900Z [ INFO ] lane[5] : sender 0 receiver 4608 2024-05-15T11:43:43.2819178Z [ INFO ] lane[6] : sender 0 receiver 4608 2024-05-15T11:43:43.2819441Z [ INFO ] lane[7] : sender 0 receiver 4608 2024-05-15T11:43:43.2819712Z [ INFO ] lane[8] : sender 0 receiver 4608 2024-05-15T11:43:43.2819985Z [ INFO ] lane[9] : sender 0 receiver 4608 2024-05-15T11:43:43.2820261Z [ INFO ] lane[10] : sender 0 receiver 4608 2024-05-15T11:43:43.2820527Z [ INFO ] lane[11] : sender 0 receiver 4608 2024-05-15T11:43:43.2820803Z [ INFO ] lane[12] : sender 0 receiver 4608 2024-05-15T11:43:43.2821089Z [ INFO ] lane[13] : sender 0 receiver 4608 2024-05-15T11:43:43.2821364Z [ INFO ] lane[14] : sender 0 receiver 4608 2024-05-15T11:43:43.2821619Z [ INFO ] lane[15] : sender 0 receiver 4608 2024-05-15T11:43:43.2821891Z [ INFO ] lane[16] : sender 0 receiver 4608 2024-05-15T11:43:43.2822163Z [ INFO ] lane[17] : sender 0 receiver 4608 2024-05-15T11:43:43.2822434Z [ INFO ] lane[18] : sender 0 receiver 4608 2024-05-15T11:43:43.2822707Z [ INFO ] lane[19] : sender 0 receiver 4608 2024-05-15T11:43:43.2822967Z [ INFO ] lane[20] : sender 0 receiver 4608 2024-05-15T11:43:43.2823339Z [ INFO ] lane[21] : sender 0 receiver 4608 2024-05-15T11:43:43.2823692Z [ INFO ] lane[22] : sender 0 receiver 4608 2024-05-15T11:43:43.2823979Z [ INFO ] lane[23] : sender 0 receiver 4608 2024-05-15T11:43:43.2824239Z [ INFO ] lane[24] : sender 0 receiver 4608 2024-05-15T11:43:43.2824508Z [ INFO ] lane[25] : sender 0 receiver 4608 2024-05-15T11:43:43.2825385Z [ INFO ] lane[26] : sender 0 receiver 4608 2024-05-15T11:43:43.2825933Z [ INFO ] lane[27] : sender 0 receiver 4608 2024-05-15T11:43:43.2826546Z [ INFO ] lane[28] : sender 0 receiver 4608 2024-05-15T11:43:43.2827195Z [ INFO ] lane[29] : sender 0 receiver 4608 2024-05-15T11:43:43.2827726Z [ INFO ] lane[30] : sender 0 receiver 4608 2024-05-15T11:43:43.2828259Z [ INFO ] lane[31] : sender 0 receiver 4608 2024-05-15T11:43:43.2828766Z [ INFO ] lane[32] : sender 0 receiver 4608 2024-05-15T11:43:43.2829287Z [ INFO ] lane[33] : sender 0 receiver 4608 2024-05-15T11:43:43.2829817Z [ INFO ] lane[34] : sender 0 receiver 4608 2024-05-15T11:43:43.2830376Z [ INFO ] lane[35] : sender 0 receiver 4608 2024-05-15T11:43:43.2830911Z [ INFO ] lane[36] : sender 0 receiver 4608 2024-05-15T11:43:43.2831429Z [ INFO ] lane[37] : sender 0 receiver 4608 2024-05-15T11:43:43.2831966Z [ INFO ] lane[38] : sender 0 receiver 4608 2024-05-15T11:43:43.2832500Z [ INFO ] lane[39] : sender 0 receiver 4608 2024-05-15T11:43:43.2833047Z [ INFO ] lane[40] : sender 0 receiver 4608 2024-05-15T11:43:43.2833579Z [ INFO ] lane[41] : sender 0 receiver 4608 2024-05-15T11:43:43.2834116Z [ INFO ] lane[42] : sender 0 receiver 4608 2024-05-15T11:43:43.2834656Z [ INFO ] lane[43] : sender 0 receiver 4608 2024-05-15T11:43:43.2835244Z [ INFO ] lane[44] : sender 0 receiver 4608 2024-05-15T11:43:43.2835780Z [ INFO ] lane[45] : sender 0 receiver 4608 2024-05-15T11:43:43.2836315Z [ INFO ] lane[46] : sender 0 receiver 4608 2024-05-15T11:43:43.2836852Z [ INFO ] lane[47] : sender 0 receiver 4608 2024-05-15T11:43:43.2837385Z [ INFO ] lane[48] : sender 0 receiver 4608 2024-05-15T11:43:43.2837919Z [ INFO ] lane[49] : sender 0 receiver 4608 2024-05-15T11:43:43.2838443Z [ INFO ] lane[50] : sender 0 receiver 4608 2024-05-15T11:43:43.2838981Z [ INFO ] lane[51] : sender 0 receiver 4608 2024-05-15T11:43:43.2839630Z [ INFO ] lane[52] : sender 0 receiver 4608 2024-05-15T11:43:43.2840165Z [ INFO ] lane[53] : sender 0 receiver 4608 2024-05-15T11:43:43.2840683Z [ INFO ] lane[54] : sender 0 receiver 4608 2024-05-15T11:43:43.2841231Z [ INFO ] lane[55] : sender 0 receiver 4608 2024-05-15T11:43:43.2841762Z [ INFO ] lane[56] : sender 0 receiver 4608 2024-05-15T11:43:43.2842291Z [ INFO ] lane[57] : sender 0 receiver 4096 2024-05-15T11:43:43.2842807Z [ INFO ] lane[58] : sender 0 receiver 0 2024-05-15T11:43:43.2843428Z /__w/2/s/contrib/../test/gtest/ucp/test_ucp_tag_xfer.cc:1265: Failure 2024-05-15T11:43:43.2844103Z Expected: (sender_tx + receiver_tx) >= (chunk_size), actual: 0 vs 4096 2024-05-15T11:43:43.2844668Z [ INFO ] lane[59] : sender 0 receiver 0 2024-05-15T11:43:43.2845286Z /__w/2/s/contrib/../test/gtest/ucp/test_ucp_tag_xfer.cc:1265: Failure 2024-05-15T11:43:43.2845874Z Expected: (sender_tx + receiver_tx) >= (chunk_size), actual: 0 vs 4096 2024-05-15T11:43:43.2846440Z [ INFO ] lane[60] : sender 0 receiver 0 2024-05-15T11:43:43.2847009Z /__w/2/s/contrib/../test/gtest/ucp/test_ucp_tag_xfer.cc:1265: Failure 2024-05-15T11:43:43.2847603Z Expected: (sender_tx + receiver_tx) >= (chunk_size), actual: 0 vs 4096 2024-05-15T11:43:43.2848146Z [ INFO ] lane[61] : sender 0 receiver 0 2024-05-15T11:43:43.2848716Z /__w/2/s/contrib/../test/gtest/ucp/test_ucp_tag_xfer.cc:1265: Failure 2024-05-15T11:43:43.2849306Z Expected: (sender_tx + receiver_tx) >= (chunk_size), actual: 0 vs 4096 2024-05-15T11:43:43.2849864Z [ INFO ] lane[62] : sender 0 receiver 0 2024-05-15T11:43:43.2850427Z /__w/2/s/contrib/../test/gtest/ucp/test_ucp_tag_xfer.cc:1265: Failure 2024-05-15T11:43:43.2851004Z Expected: (sender_tx + receiver_tx) >= (chunk_size), actual: 0 vs 4096 2024-05-15T11:43:43.2851566Z [ INFO ] lane[63] : sender 0 receiver 0 2024-05-15T11:43:43.2852131Z /__w/2/s/contrib/../test/gtest/ucp/test_ucp_tag_xfer.cc:1263: Failure 2024-05-15T11:43:43.2852709Z Expected: (sender_tx + receiver_tx) > (0), actual: 0 vs 0 2024-05-15T11:43:43.6603087Z [ FAILED ] rc/multi_rail_max.max_lanes/7, where GetParam() = rc/proto_v1 (1096 ms)
@yosefe It seems related but I tried to reproduce it 100 times and test always ends successfully. I also reran the CI and it passed too. That's very strange, but I think it can be some configurational issue. How do you think, can we merge it now or should we try to reproduce it more?
@yosefe It seems related but I tried to reproduce it 100 times and test always ends successfully. I also reran the CI and it passed too. That's very strange, but I think it can be some configurational issue. How do you think, can we merge it now or should we try to reproduce it more?
One weird thing i see here is we expect 64 lanes with protov1 test. Maybe it's wrong?
One weird thing i see here is we expect 64 lanes with protov1 test. Maybe it's wrong?
We set MAX_RNDV_LANES=64 in that test case, so if I understand the logic correctly, it is OK to expect 64 lanes.
i think we should limit protov1 tests to 16 lanes, since there may be places in protov1 flows we are not updating to support more lanes
/azp run
Azure Pipelines successfully started running 4 pipeline(s).
/azp run
Azure Pipelines successfully started running 4 pipeline(s).
we need to allow 64 lanes starting from v1.18 only to preserve wire-compat
It can be done by adding UCX_MAX_LANES control with auto default value which would also allow users to configure number of lanes limit
Failure seems relevant and caused by changing message size in the test. Some tests set RNDV_THRESH to specific value and expect the message to be transferred by eager protocol https://github.com/openucx/ucx/blob/e8c7a6cac155bc801f5ae9e7adbe879f9d07158c/test/gtest/ucp/test_ucp_tag_xfer.cc#L1142
https://dev.azure.com/ucfconsort/0b36e3f0-8ab9-4a48-b68b-4b2350e02c88/_apis/build/builds/81138/logs/446
2024-05-24T13:39:43.3371935Z /scrap/azure/agent-02/AZP_WORKSPACE/2/s/contrib/../test/gtest/ucp/test_ucp_tag_xfer.cc:1086: Failure
2024-05-24T13:39:43.3373318Z Expected equality of these values:
2024-05-24T13:39:43.3373921Z 1ul
2024-05-24T13:39:43.3374364Z Which is: 1
2024-05-24T13:39:43.3374768Z cnt
2024-05-24T13:39:43.3375165Z Which is: 0
2024-05-24T13:39:43.3375630Z TX counter
2024-05-24T13:39:43.3376237Z /scrap/azure/agent-02/AZP_WORKSPACE/2/s/contrib/../test/gtest/ucp/test_ucp_tag_xfer.cc:1088: Failure
2024-05-24T13:39:43.3376785Z Expected equality of these values:
2024-05-24T13:39:43.3377197Z 1ul
2024-05-24T13:39:43.3377596Z Which is: 1
2024-05-24T13:39:43.3377968Z cnt
2024-05-24T13:39:43.3378337Z Which is: 0
2024-05-24T13:39:43.3378758Z RX counter
2024-05-24T13:39:43.3462848Z [ FAILED ] tcp/test_ucp_tag_stats.eager_expected/0, where GetParam() = tcp (44 ms)
@ivankochin pls check https://dev.azure.com/ucfconsort/ucx/_build/results?buildId=81183&view=logs&j=cc064a77-22b5-56bf-ecc0-70b5fe764261&t=aedfd754-44a6-53e6-6843-8659d800fee2 - code style
@ivankochin pls check https://dev.azure.com/ucfconsort/ucx/_build/results?buildId=81183&view=logs&j=cc064a77-22b5-56bf-ecc0-70b5fe764261&t=aedfd754-44a6-53e6-6843-8659d800fee2 - code style
Done.
@yosefe can I squash?
@ivankochin @yosefe Did the changes from the PR go into v1.17.x ?