aibrix
aibrix copied to clipboard
NIXL/RDMA enabled StormService sees engine boot crash with UCX ERROR failed to modify to RTR
🐛 Describe the bug
Pod log says: ib_mlx5dv_md.c:1100 UCX ERROR mlx5_1: ibv_modify_qp(UMR QP 0x2c5c) failed to modify to RTR: No such device
Steps to Reproduce
- Deploy sglang 2p2d tp2 for Qwen3-32B.
- Allocate 1 RDMA per pod.
Expected behavior
No crash during boot.
Environment
- AIBrix 0.5.0
- VKE