aibrix icon indicating copy to clipboard operation
aibrix copied to clipboard

NIXL/RDMA enabled StormService sees engine boot crash with UCX ERROR failed to modify to RTR

Open dczhu opened this issue 1 month ago • 0 comments

🐛 Describe the bug

Pod log says: ib_mlx5dv_md.c:1100 UCX ERROR mlx5_1: ibv_modify_qp(UMR QP 0x2c5c) failed to modify to RTR: No such device

Steps to Reproduce

  1. Deploy sglang 2p2d tp2 for Qwen3-32B.
  2. Allocate 1 RDMA per pod.

Expected behavior

No crash during boot.

Environment

  • AIBrix 0.5.0
  • VKE

dczhu avatar Nov 19 '25 21:11 dczhu