aibrix icon indicating copy to clipboard operation
aibrix copied to clipboard

Engine crashes with ibv_create_ah "UD mlx5" and "RC DEVX QP" no such device

Open dczhu opened this issue 1 month ago • 0 comments

🐛 Describe the bug

2p2d error logs:

  • P0 log: ib_device.c:1380 UCX ERROR ibv_create_ah(dlid=49152 sl=0 port=1 src_path_bits=0 dgid=fe80::9e63:c0ff:fe73:bf12 flow_label=0xffffffff sgid_index=0 traffic_class=0) for UD mlx5 connect on mlx5_1 failed: No such device
  • P1 log: ib_device.c:1380 UCX ERROR ibv_create_ah(dlid=49152 sl=0 port=1 src_path_bits=0 dgid=fe80::9e63:c0ff:fe72:76e6 flow_label=0xffffffff sgid_index=0 traffic_class=0) for UD mlx5 connect on mlx5_3 failed: No such device
  • D0 log: ib_device.c:1380 UCX ERROR ibv_create_ah(dlid=49152 sl=0 port=1 src_path_bits=0 dgid=fe80::9e63:c0ff:fe74:ba44 flow_label=0xffffffff sgid_index=0 traffic_class=0) for UD mlx5 connect on mlx5_6 failed: No such device
  • D1 log: ib_device.c:1380 UCX ERROR ibv_create_ah(dlid=49152 sl=0 port=1 src_path_bits=0 dgid=fe80::9e63:c0ff:fe76:303c flow_label=0xffffffff sgid_index=0 traffic_class=0) for RC DEVX QP connect on mlx5_8 failed: No such device

Steps to Reproduce

  1. Deploy sglang 2p2d tp2 for Qwen3-32B.
  2. Allocate 2 RDMA devices per pod.
  3. Run the pods on the same node.

Expected behavior

No crash

Environment

  • AIBrix 0.5.0
  • VKE
  • Node conf: 8 GPUs, 8 RDMA devs

dczhu avatar Nov 19 '25 21:11 dczhu