assertion after connection made with ethernet DPU
a UCP endpoint is created between an ethernet DPU and an x86 server. UCX_NET_DEVICES points to ethernet devices on both machines, soon after the connection, program aborts, probably because this assert:
[1646206066.192445] [swx-proton02-bf1:957999:async] ib_iface.c:750 UCX DEBUG iface 0xaaaafacce390: ah_attr dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:2.1.1.4 sgid_index=1 traffic_class=106
[swx-proton02-bf1:957999:async:958002] ud_ep.c:741 Assertion `ep->rx.ooo_pkts.head_sn == neth->psn' failed: iface=0xaaaafacce390 ep=0xaaaafac55a30 conn_sn=0 ep_id=0, dest_ep_id=0 rx_psn=3 neth_psn=1 ep_flags=0xd8 ctl_ops=0x0 rx_creq_count=1
==== backtrace (tid: 958002) ====
0 0x0000000000053144 uct_ud_ep_process_rx() ???:0
1 0x0000000000058be4 uct_ud_mlx5_ep_t_delete() ???:0
2 0x000000000004dcc8 uct_dc_mlx5_iface_devx_set_srq_dc_params() ???:0
3 0x0000000000012a84 ucs_cpu_get_memcpy_bw() ???:0
4 0x0000000000013590 ucs_async_dispatch_handlers() ???:0
5 0x0000000000016530 ucs_async_pipe_drain() ???:0
6 0x000000000002d590 ucs_event_set_wait() ???:0
7 0x00000000000168ac ucs_async_pipe_drain() ???:0
8 0x0000000000008604 start_thread() /build/glibc-MwsH5o/glibc-2.31/nptl/pthread_create.c:477
9 0x00000000000d45fc clone() ???:0
=================================
Aborted (core dumped)
Steps to Reproduce
- UCX environment variables used:
- UCX_LOG_LEVEL=DATA
- UCX_TCP_CONN_NB=y
- UCX_NET_DEVICES=<ethernet device>
Setup and versions
- DPU
- ubuntu20,
- ofed_info: MLNX_OFED_LINUX-5.4-1.0.3.0:
- exported UCX_NET_DEVICES=mlx5_2:1
- ibstat mlx5_2:
CA 'mlx5_2' CA type: MT41686 Number of ports: 1 Firmware version: 24.31.2006 Hardware version: 1 Node GUID: 0x028ea9fffe18a28e System image GUID: 0x08c0eb030053eb1c Port 1: State: Active Physical state: LinkUp Rate: 25 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0x008ea9fffe18a28e Link layer: Ethernet - x86 server
- ubuntu20
- ofed_info: MLNX_OFED_LINUX-5.5-0.2.6.0:
- exported UCX_NET_DEVICES=mlx5_0:1
- ibstat mlx5_0:
CA 'mlx5_0' CA type: MT41682 Number of ports: 1 Firmware version: 18.28.1002 Hardware version: 0 Node GUID: 0xb8599f03002f1d7c System image GUID: 0xb8599f03002f1d7c Port 1: State: Active Physical state: LinkUp Rate: 25 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0xba599ffffe2f1d7c Link layer: Ethernet
Additional information (depending on the issue)
ping -I <net device interface> <remote machine ethernet device IP>works on both directions- this bug doesn't happen when running the same configuration on 2 x86 servers with ethernet connection dpu_ucx_logs.txt x86_ucx_logs.txt
what is the UCX version used by you?
I think that the neth->psn < ep->rx.ooo_pkts.head_sn (i.e. 1 < 3) case should be handled here:
https://github.com/openucx/ucx/blob/488cb507f12dbb6904bd7e7058bb1e7206ddc525/src/uct/ib/ud/base/ud_ep.c#L815
see https://github.com/openucx/ucx/pull/7353 PR description which fixes the issue.
@jsofri did you have a chance to check v1.12 or master branch?
@dmitrygx not fully, we are restricted by a specific mofed version.. Need to further investigate this
@dmitrygx not fully, we are restricted by a specific mofed version.. Need to further investigate this
we believe the error should be fixed by the PR above, in UCX v1.12.x
@yosefe @dmitrygx happy to update that you are correct, using v1.12.0 solved it
root@bff1c4b09bf0:/# ucx_info -v
# UCT version=1.12.0 revision 6ab55c4
@yosefe @dmitrygx happy to update that you are correct, using v1.12.0 solved it
root@bff1c4b09bf0:/# ucx_info -v # UCT version=1.12.0 revision 6ab55c4
@jsofri great, thanks for updating us!