aibrix
aibrix copied to clipboard
Engine crashes with ibv_create_ah "UD mlx5" and "RC DEVX QP" no such device
🐛 Describe the bug
2p2d error logs:
- P0 log: ib_device.c:1380 UCX ERROR ibv_create_ah(dlid=49152 sl=0 port=1 src_path_bits=0 dgid=fe80::9e63:c0ff:fe73:bf12 flow_label=0xffffffff sgid_index=0 traffic_class=0) for UD mlx5 connect on mlx5_1 failed: No such device
- P1 log: ib_device.c:1380 UCX ERROR ibv_create_ah(dlid=49152 sl=0 port=1 src_path_bits=0 dgid=fe80::9e63:c0ff:fe72:76e6 flow_label=0xffffffff sgid_index=0 traffic_class=0) for UD mlx5 connect on mlx5_3 failed: No such device
- D0 log: ib_device.c:1380 UCX ERROR ibv_create_ah(dlid=49152 sl=0 port=1 src_path_bits=0 dgid=fe80::9e63:c0ff:fe74:ba44 flow_label=0xffffffff sgid_index=0 traffic_class=0) for UD mlx5 connect on mlx5_6 failed: No such device
- D1 log: ib_device.c:1380 UCX ERROR ibv_create_ah(dlid=49152 sl=0 port=1 src_path_bits=0 dgid=fe80::9e63:c0ff:fe76:303c flow_label=0xffffffff sgid_index=0 traffic_class=0) for RC DEVX QP connect on mlx5_8 failed: No such device
Steps to Reproduce
- Deploy sglang 2p2d tp2 for Qwen3-32B.
- Allocate 2 RDMA devices per pod.
- Run the pods on the same node.
Expected behavior
No crash
Environment
- AIBrix 0.5.0
- VKE
- Node conf: 8 GPUs, 8 RDMA devs