[Bug] Two Node H20 with ROCE, can't startup
Checklist
- [x] 1. I have searched related issues but cannot get the expected help.
- [x] 2. The bug has not been fixed in the latest version.
- [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [x] 5. Please use English, otherwise it will be closed.
Describe the bug
sglang cant's start when using two node h20 to deploy.... hang!
Reproduction
I m trying to enhance my h20*2 inference efficiency by using ROCE. There were not prolblems using Socket Mode for NCCL.
Now , i'm using the 0.4.2 offiicial image and i have add some rdma pacakges in it according the commit: https://github.com/FrankLeeeee/sglang/commit/cd837ab34a895976dc3176e0272a870090c59e36
Now , sglang cant's start.
and the GPU-util is 100%!!!
Environment
containerd. sglang 0.4.2 roce 200Gb*4
cc @FrankLeeeee @zhyncs
Looking forward to your PR tomorrow! RDMA wins!
great!
It's NCCL Problems..
Fixed. https://docs.sglang.ai/references/multi_node_inference_k8s_lws.html