ucx add DC config to set lag_tx_port_affinity according to lag mode info

Signed-off-by: Changcheng Liu [email protected]

What

Check lag mode info to decide set lag_tx_port_affinity or not. Add configuration to force set/force not set/auto detect decision to configure DCI lag_tx_port_affinity.

Why ?

In hash mode under non-switchdev, it's better to not set lag_tx_port_affinity to decrease explicit port flow table overhead.

How ?

Define new IFC bits uct_ib_mlx5_lag_context_bits::port_select_mode to get lag mode info. For lag device,

When UCX_DC_MLX5_LAG_PORT_SELECT=auto(default) In hash mode under non-switchdev, this PR does not set DCI lag_tx_port_affinity. if lag device supports bypass port select flow table, it's better to not set lag_tx_port_affinity to decrease explicit port flow table overhead. if lag device does not support bypass port select flow table, it's not effective to set lag_tx_port_affinity at all. In queue_affinity mode, this PR decides to set lag_tx_port_affinity.
When UCX_DC_MLX5_LAG_PORT_SELECT=affinity In hash mode under non-switchdev, this PR do set DCI lag_tx_port_affinity. if lag device supports bypass port select flow table, it's effective to set the affinity. if lag device does not support bypass port select flow table, it's not effective to set lag_tx_port_affinity at all. This doesn't matter. In queue_affinity mode, it's effective to set the affinity.
When UCX_DC_MLX5_LAG_PORT_SELECT=hash In hash mode under non-switchdev, this PR does not set DCI lag_tx_port_affinity. if lag device supports bypass port select flow table, this is the expected configuration to avoid overload explicit flow table. if lag device does not support bypass port select flow table, there's no extra effect at all because it's event not effective to set lag_tx_port_affinity in this case. In queue_affinity mode, because it doesn't set lag_tx_port_affinity explicitly, FW will set it in round roubin method.

Jul 13 '22 04:07 changchengx

In UCX PR(io_demo Test tag match on CX4), the log shows it's related with this PR

2022-07-14T06:22:22.7512823Z swx-rain01: [1657779742.748449] [DEMO] read 1049.05 MBs min:8372(2.1.5.2:53090) max:8372 total:8372 | write 1058.05 MBs min:8382(2.1.5.2:53090) max:8382 total:8382 | active: 1/1, buffers:16
2022-07-14T06:22:24.7503984Z swx-rain01: [1657779744.748456] [DEMO] read 0 MBs min:0(2.1.5.2:53090) max:0 total:0 | write 0 MBs min:0(2.1.5.2:53090) max:0 total:0 | active: 1/1, buffers:16
2022-07-14T06:22:26.7502751Z swx-rain01: [1657779746.748470] [DEMO] read 0 MBs min:0(2.1.5.2:53090) max:0 total:0 | write 0 MBs min:0(2.1.5.2:53090) max:0 total:0 | active: 1/1, buffers:16
2022-07-14T06:22:28.7502709Z swx-rain01: [1657779748.748483] [DEMO] read 0 MBs min:0(2.1.5.2:53090) max:0 total:0 | write 0 MBs min:0(2.1.5.2:53090) max:0 total:0 | active: 1/1, buffers:16
2022-07-14T06:22:30.7512711Z swx-rain01: [1657779750.748492] [DEMO] read 0 MBs min:0(2.1.5.2:53090) max:0 total:0 | write 0 MBs min:0(2.1.5.2:53090) max:0 total:0 | active: 1/1, buffers:16
2022-07-14T06:22:32.7502750Z swx-rain01: [1657779752.748505] [DEMO] read 0 MBs min:0(2.1.5.2:53090) max:0 total:0 | write 0 MBs min:0(2.1.5.2:53090) max:0 total:0 | active: 1/1, buffers:16
2022-07-14T06:22:33.4442553Z swx-rain01: [1657779753.440905] [UCX-connection 0xe3e0c0: #2 2.1.5.2:53090] detected error: Endpoint timeout
2022-07-14T06:22:33.4472611Z swx-rain01: [1657779753.440932] [UCX] removed [UCX-connection 0xe3e0c0: #2 2.1.5.2:53090] from connection map
2022-07-14T06:22:33.4473964Z swx-rain01: [1657779753.441203] [swx-rain01:83740:0]     ib_mlx5_log.c:177  UCX  DIAG  Transport retry count exceeded on mlx5_bond_0:1/RoCE (synd 0x15 vend 0x81 hw_synd 0/0)
2022-07-14T06:22:33.4475678Z swx-rain01: [1657779753.441203] [swx-rain01:83740:0]     ib_mlx5_log.c:177  UCX  DIAG  RC QP 0x280e wqe[22589]: RDMA_READ s-- [rva 0x7f6341062000 rkey 0x16e1c0] [va 0x7fbacaf6e000 len 213504 lkey 0xf6f4e] [rqpn 0x7488 dlid=0 sl=0 port=1 src_path_bits=0 dgid=::ffff:2.1.5.2 sgid_index=3 traffic_class=0]
2022-07-14T06:22:33.4477063Z swx-rain01: [1657779753.442664] [DEMO] disconnecting connection [UCX-connection 0xe3e0c0: #2 2.1.5.2:53090] with status Endpoint timeout

Jul 15 '22 04:07 changchengx

@yosefe Please help review this PR.

Jul 17 '22 08:07 changchengx

@yosefe Please help review this PR.

Aug 18 '22 05:08 changchengx

@yosefe Please help review this PR when you're available.

Aug 28 '22 08:08 changchengx

@yosefe

Is it relevant for RC QP as well? Or we have to set tx port affinity for it anyway?

In lag hash mode(it'll be default lag mode both in OFED and upstream/linux), it also affects RC QP performance if the RC QP is set with port affinity. However, I think this can't be avoided. RC QP is one-2-one reliable connection, for the HCA with multiple ports, ucp_ep has multiple lanes in lag mode. If it doesn't set port affinity to RC QP, it means that the RC QPs connected between two hosts may always use the same port. This is not expected for the messages which is spliced into multiple segments to use different lanes.

Without this PR, performance will be affected by a little in lag hash mode. After applying this PR, for DC transports:

RNDV protocol Without setting DCI lag_tx_port_affinity in lag hash mode, it will avoid checking extra steering domain to avoid one hop check. Performance maybe improved a little.
EAGER protocol For the message that needs to be spliced into multiple segments, it may use same port to send/recv data between two hosts because hash mode steer the traffic to use same port if they're not configured with the different port affinity. So, the performance maybe downgraded.

Oct 25 '22 07:10 changchengx

I'm fixing all the comments. I'm going to update them after passing the CI check rules.

Oct 25 '22 07:10 changchengx

The CI errors are not related with this PR:

go gpu

2022-10-25T15:51:21.6848238Z ##[warning]Module dev/go-latest cannot be loaded
2022-10-25T15:51:21.6863122Z ##[error]Bash exited with code '1'.

Tests gpu on worker 3

2022-10-25T17:31:38.6064772Z [----------] 9 tests from tcp/test_ucp_sockaddr_protocols
2022-10-25T17:31:38.6066046Z [ RUN      ] tcp/test_ucp_sockaddr_protocols.tag_zcopy_64k_unexp/0 <tcp,cuda_copy,rocm_copy/mt>
2022-10-25T17:31:39.2793128Z [     INFO ] server listening on 10.224.36.92:36625
2022-10-25T17:31:39.6394090Z [     INFO ] ignoring error Connection reset by remote peer on endpoint 0xfa33140
2022-10-25T17:31:39.8993134Z [       OK ] tcp/test_ucp_sockaddr_protocols.tag_zcopy_64k_unexp/0 (1295 ms)
2022-10-25T17:31:39.8994289Z [ RUN      ] tcp/test_ucp_sockaddr_protocols.am_rndv_64k_recv_prereg_single_rndv_put_zcopy_lane/0 <tcp,cuda_copy,rocm_copy/mt>
2022-10-25T17:31:40.6027321Z [     INFO ] server listening on 12.10.44.11:52368
2022-10-25T17:31:45.8785997Z ==26453== Invalid read of size 8
2022-10-25T17:31:45.8787666Z ==26453==    at 0x52C7921: uct_md_mem_query (uct_md.c:624)
2022-10-25T17:31:45.8788782Z ==26453==    by 0x574DEB1: ucp_memory_detect_slowpath (ucp_context.c:2189)
2022-10-25T17:31:45.8790021Z ==26453==    by 0x575BB6B: ucp_memory_detect_internal (ucp_context.h:612)
2022-10-25T17:31:45.8791186Z ==26453==    by 0x575BB6B: ucp_memory_detect (ucp_context.h:629)
2022-10-25T17:31:45.8792471Z ==26453==    by 0x575BB6B: ucp_request_get_memory_type (ucp_request.inl:986)
2022-10-25T17:31:45.8793672Z ==26453==    by 0x575BB6B: ucp_am_send_req_init (ucp_am.c:818)
2022-10-25T17:31:45.8794698Z ==26453==    by 0x575BB6B: ucp_am_send_nbx_inner (ucp_am.c:1040)
2022-10-25T17:31:45.8795733Z ==26453==    by 0x575BB6B: ucp_am_send_nbx (ucp_am.c:952)
2022-10-25T17:31:45.8797011Z ==26453==    by 0xA2F4C4: test_ucp_sockaddr_protocols::test_am_send_recv(unsigned long, unsigned long, unsigned long, bool, bool) (test_ucp_sockaddr.cc:2542)
2022-10-25T17:31:45.8798408Z ==26453==    by 0x6086D5: run (test.cc:378)
2022-10-25T17:31:45.8799643Z ==26453==    by 0x6086D5: ucs::test_base::TestBodyProxy() (test.cc:404)
2022-10-25T17:31:45.8800813Z ==26453==    by 0xCFDFB8: HandleSehExceptionsInMethodIfSupported<testing::Test, void> (gtest.cc:2433)
2022-10-25T17:31:45.8802659Z ==26453==    by 0xCFDFB8: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (gtest.cc:2469)
2022-10-25T17:31:45.8804104Z ==26453==    by 0xCF4BC8: testing::Test::Run() (gtest.cc:2509)
2022-10-25T17:31:45.8805199Z ==26453==    by 0xCF4CF0: testing::TestInfo::Run() (gtest.cc:2687)
2022-10-25T17:31:45.8806250Z ==26453==    by 0xCF4DB4: testing::TestSuite::Run() (gtest.cc:2819)
2022-10-25T17:31:45.8807361Z ==26453==    by 0xCF58D9: testing::internal::UnitTestImpl::RunAllTests() (gtest.cc:5350)
2022-10-25T17:31:45.8808770Z ==26453==    by 0xCF5A90: HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (gtest.cc:2433)
2022-10-25T17:31:45.8810160Z ==26453==    by 0xCF5A90: HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (gtest.cc:2469)
2022-10-25T17:31:45.8811477Z ==26453==    by 0xCF5A90: testing::UnitTest::Run() (gtest.cc:4940)
2022-10-25T17:31:45.8812667Z ==26453==    by 0x58C468: RUN_ALL_TESTS (gtest.h:2473)
2022-10-25T17:31:45.8813861Z ==26453==    by 0x58C468: main (main.cc:106)
2022-10-25T17:31:45.8815922Z ==26453==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
2022-10-25T17:31:45.8816978Z ==26453== 
2022-10-25T17:31:45.8895540Z [swx-rdmz-ucx-gpu-01:26453:4:26453] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))

Tests BlueField on worker 3

2022-10-25T17:38:36.2314345Z [----------] 1 test from tcp_ib/select_transport_rma_bw
2022-10-25T17:38:36.2321612Z [ RUN      ] tcp_ib/select_transport_rma_bw.select_rc/0 <tcp,ib/rma>
2022-10-25T17:38:36.8903144Z /scrap/azure/agent-03/AZP_WORKSPACE/1/s/contrib/../test/gtest/ucp/test_ucp_wireup.cc:1348: Failure
2022-10-25T17:38:36.8906526Z Expected equality of these values:
2022-10-25T17:38:36.8908270Z   "rc_mlx5"
2022-10-25T17:38:36.8911348Z   ucp_ep_get_tl_rsc(sender().ep(), lane)->tl_name
2022-10-25T17:38:36.8913447Z     Which is: "tcp"
2022-10-25T17:38:37.1032479Z [  FAILED  ] tcp_ib/select_transport_rma_bw.select_rc/0, where GetParam() = tcp,ib/rma (874 ms)

Distro Test wire compatibility ubuntu20 failed

2022-10-25T15:31:47.0174179Z + ./bin/ucx_perftest -t tag_bw -p 14000
2022-10-25T15:31:47.0432904Z [1666711907.042376] [swx-rdmz-instinct02:3013965:0]        perftest.c:922  UCX  WARN  CPU affinity is not set (bound to 4 cpus). Performance may be impacted.
2022-10-25T15:31:47.0435601Z [1666711907.042439] [swx-rdmz-instinct02:3013965:0]        perftest.c:407  UCX  ERROR server failed. bind() failed: Address already in use
2022-10-25T15:31:48.0193638Z + ./bin/ucx_perftest -t tag_bw -p 14000 127.0.0.1
2022-10-25T15:31:48.0195572Z + tee perf.txt
2022-10-25T15:31:48.0277774Z [1666711908.025866] [swx-rdmz-instinct02:3014365:0]        perftest.c:922  UCX  WARN  CPU affinity is not set (bound to 4 cpus). Performance may be impacted.
2022-10-25T15:31:48.0280657Z [1666711908.026001] [swx-rdmz-instinct02:3014365:0]        perftest.c:407  UCX  ERROR client failed. connect() failed: Connection refused
2022-10-25T15:31:48.0318114Z + wait 3013965
2022-10-25T15:31:48.0363212Z ##[error]Bash exited with code '255'.
2022-10-25T15:31:48.0383749Z ##[section]Finishing: Test ucx_perftest
2022-10-25T15:31:48.0556311Z ##[section]Starting: Checkout

Oct 26 '22 01:10 changchengx

@yosefe Let me know if there're other places need to be changed. If it's needed, this PR should be tested in large scale to check whether there're obvious performance improvement.

Nov 02 '22 05:11 changchengx

@yosefe What's the next step?

Nov 14 '22 01:11 changchengx

To avoid misunderstanding, I add info here: Without this PR, DC could work as expected when using lag hash mode.

For lag hash mode, if the QP is set affinity explicitly, it takes time to look for the affinity port in firmware, this PR is to do some optimization:

What's the advantage of not setting DCI affinity? For the QP setting port affinity under lag hash mode, the firmware implemented a mechanism to look for the affinity port this this QP. However, it takes time to "look for the affinity port". I want to avoid this behavior by not setting the DCI port affinity.
What's the problem will be raised if UCX doesn't set DCI affinity under lag hash mode? For eager protocol, if the message is segmented into two segments, it's better that the two segments could be sent from the two ports(DCI in different groups). However, if it does not DCI affinity after using this PR, the two segments will be sent from one port.
Why not change RC QP in similar method used in this PR? It's always expected for the messages send from the two RC QPs egressed to two ports. I don't want to change the behavior even it takes time to look for the affinity port under lag hash mode. Or, the RC performance may degrade a lot.
For DCI after using this PR, the default "auto" behavior is to detect the port_select_mode and 1) doesn't set DCI affinity under lag hash mode 2) set DCI affinity under other lag mode.
If it really needs to set port affinity by force, it could use "ON" option to set it.

Nov 30 '22 06:11 changchengx

According to below failure logs, the failed test cases are not related with this PR

UCX PR(jucx gpu java8)

2022-12-01T08:54:52.3549594Z [1669884892.354152] [swx-rdmz-ucx-gpu-02:20922:0]        memtrack.c:328  UCX  WARN  allocated zero-size block 0x7f74f0e13840 for temp mds
2022-12-01T08:54:52.3552287Z [1669884892.354211] [swx-rdmz-ucx-gpu-02:20922:0]     ucp_context.c:1080 UCX  WARN  transport 'abcd' is not available, please use one or more of: cma, cuda, cuda_copy, cuda_ipc, dc, dc_mlx5, dc_x, ib, knem, mm, posix, rc, rc_mlx5, rc_v, rc_verbs, rc_x, self, shm, sm, sysv, tcp, ud, ud_mlx5, ud_v, ud_verbs, ud_x
2022-12-01T08:54:52.3554805Z [1669884892.354221] [swx-rdmz-ucx-gpu-02:20922:0]     ucp_context.c:1339 UCX  ERROR no usable transports/devices (asked abcd on all devices)
2022-12-01T08:54:52.4084845Z Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.14 sec

UCX PR(jucx gpu java11)

Running org.openucx.jucx.UcpWorkerTest
Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 3.622 sec
Running org.openucx.jucx.UcpContextTest
[1669884892.354152] [swx-rdmz-ucx-gpu-02:20922:0]        memtrack.c:328  UCX  WARN  allocated zero-size block 0x7f74f0e13840 for temp mds
[1669884892.354211] [swx-rdmz-ucx-gpu-02:20922:0]     ucp_context.c:1080 UCX  WARN  transport 'abcd' is not available, please use one or more of: cma, cuda, cuda_copy, cuda_ipc, dc, dc_mlx5, dc_x, ib, knem, mm, posix, rc, rc_mlx5, rc_v, rc_verbs, rc_x, self, shm, sm, sysv, tcp, ud, ud_mlx5, ud_v, ud_verbs, ud_x
[1669884892.354221] [swx-rdmz-ucx-gpu-02:20922:0]     ucp_context.c:1339 UCX  ERROR no usable transports/devices (asked abcd on all devices)
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.14 sec
Running org.openucx.jucx.UcpEndpointTest

Dec 02 '22 03:12 changchengx

@yosefe It has passed CI after squashing the commits.

Dec 04 '22 08:12 changchengx