ucx icon indicating copy to clipboard operation
ucx copied to clipboard

assertion after connection made with ethernet DPU

Open jsofri opened this issue 3 years ago • 4 comments

a UCP endpoint is created between an ethernet DPU and an x86 server. UCX_NET_DEVICES points to ethernet devices on both machines, soon after the connection, program aborts, probably because this assert:

[1646206066.192445] [swx-proton02-bf1:957999:async]        ib_iface.c:750  UCX  DEBUG iface 0xaaaafacce390: ah_attr dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:2.1.1.4 sgid_index=1 traffic_class=106
[swx-proton02-bf1:957999:async:958002]       ud_ep.c:741  Assertion `ep->rx.ooo_pkts.head_sn == neth->psn' failed: iface=0xaaaafacce390 ep=0xaaaafac55a30 conn_sn=0 ep_id=0, dest_ep_id=0 rx_psn=3 neth_psn=1 ep_flags=0xd8 ctl_ops=0x0 rx_creq_count=1
==== backtrace (tid: 958002) ====
 0 0x0000000000053144 uct_ud_ep_process_rx()  ???:0
 1 0x0000000000058be4 uct_ud_mlx5_ep_t_delete()  ???:0
 2 0x000000000004dcc8 uct_dc_mlx5_iface_devx_set_srq_dc_params()  ???:0
 3 0x0000000000012a84 ucs_cpu_get_memcpy_bw()  ???:0
 4 0x0000000000013590 ucs_async_dispatch_handlers()  ???:0
 5 0x0000000000016530 ucs_async_pipe_drain()  ???:0
 6 0x000000000002d590 ucs_event_set_wait()  ???:0
 7 0x00000000000168ac ucs_async_pipe_drain()  ???:0
 8 0x0000000000008604 start_thread()  /build/glibc-MwsH5o/glibc-2.31/nptl/pthread_create.c:477
 9 0x00000000000d45fc clone()  ???:0
=================================
Aborted (core dumped)

Steps to Reproduce

  • UCX environment variables used:
    • UCX_LOG_LEVEL=DATA
    • UCX_TCP_CONN_NB=y
    • UCX_NET_DEVICES=<ethernet device>

Setup and versions

  • DPU
    • ubuntu20,
    • ofed_info: MLNX_OFED_LINUX-5.4-1.0.3.0:
    • exported UCX_NET_DEVICES=mlx5_2:1
    • ibstat mlx5_2:
    CA 'mlx5_2'
            CA type: MT41686
            Number of ports: 1
            Firmware version: 24.31.2006
            Hardware version: 1
            Node GUID: 0x028ea9fffe18a28e
            System image GUID: 0x08c0eb030053eb1c
            Port 1:
                    State: Active
                    Physical state: LinkUp
                    Rate: 25
                    Base lid: 0
                    LMC: 0
                    SM lid: 0
                    Capability mask: 0x00010000
                    Port GUID: 0x008ea9fffe18a28e
                    Link layer: Ethernet
    
  • x86 server
    • ubuntu20
    • ofed_info: MLNX_OFED_LINUX-5.5-0.2.6.0:
    • exported UCX_NET_DEVICES=mlx5_0:1
    • ibstat mlx5_0:
    CA 'mlx5_0'
          CA type: MT41682
          Number of ports: 1
          Firmware version: 18.28.1002
          Hardware version: 0
          Node GUID: 0xb8599f03002f1d7c
          System image GUID: 0xb8599f03002f1d7c
          Port 1:
                  State: Active
                  Physical state: LinkUp
                  Rate: 25
                  Base lid: 0
                  LMC: 0
                  SM lid: 0
                  Capability mask: 0x00010000
                  Port GUID: 0xba599ffffe2f1d7c
                  Link layer: Ethernet
    
    

Additional information (depending on the issue)

  • ping -I <net device interface> <remote machine ethernet device IP> works on both directions
  • this bug doesn't happen when running the same configuration on 2 x86 servers with ethernet connection dpu_ucx_logs.txt x86_ucx_logs.txt

jsofri avatar Mar 02 '22 09:03 jsofri

what is the UCX version used by you? I think that the neth->psn < ep->rx.ooo_pkts.head_sn (i.e. 1 < 3) case should be handled here: https://github.com/openucx/ucx/blob/488cb507f12dbb6904bd7e7058bb1e7206ddc525/src/uct/ib/ud/base/ud_ep.c#L815 see https://github.com/openucx/ucx/pull/7353 PR description which fixes the issue.

dmitrygx avatar Mar 02 '22 13:03 dmitrygx

@jsofri did you have a chance to check v1.12 or master branch?

dmitrygx avatar Mar 16 '22 09:03 dmitrygx

@dmitrygx not fully, we are restricted by a specific mofed version.. Need to further investigate this

jsofri avatar Mar 24 '22 11:03 jsofri

@dmitrygx not fully, we are restricted by a specific mofed version.. Need to further investigate this

we believe the error should be fixed by the PR above, in UCX v1.12.x

yosefe avatar Mar 24 '22 12:03 yosefe

@yosefe @dmitrygx happy to update that you are correct, using v1.12.0 solved it

root@bff1c4b09bf0:/# ucx_info -v
# UCT version=1.12.0 revision 6ab55c4

jsofri avatar Aug 17 '22 07:08 jsofri

@yosefe @dmitrygx happy to update that you are correct, using v1.12.0 solved it

root@bff1c4b09bf0:/# ucx_info -v
# UCT version=1.12.0 revision 6ab55c4

@jsofri great, thanks for updating us!

dmitrygx avatar Aug 17 '22 08:08 dmitrygx