sglang icon indicating copy to clipboard operation
sglang copied to clipboard

[Bug] Two Node H20 with ROCE, can't startup

Open whybeyoung opened this issue 10 months ago • 3 comments

Checklist

  • [x] 1. I have searched related issues but cannot get the expected help.
  • [x] 2. The bug has not been fixed in the latest version.
  • [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • [x] 5. Please use English, otherwise it will be closed.

Describe the bug

sglang cant's start when using two node h20 to deploy.... hang!

Reproduction

I m trying to enhance my h20*2 inference efficiency by using ROCE. There were not prolblems using Socket Mode for NCCL.

Now , i'm using the 0.4.2 offiicial image and i have add some rdma pacakges in it according the commit: https://github.com/FrankLeeeee/sglang/commit/cd837ab34a895976dc3176e0272a870090c59e36

Now , sglang cant's start.

Image

and the GPU-util is 100%!!!

Image Need Help

Environment

containerd. sglang 0.4.2 roce 200Gb*4

whybeyoung avatar Feb 16 '25 06:02 whybeyoung

cc @FrankLeeeee @zhyncs

zhaochenyang20 avatar Feb 16 '25 08:02 zhaochenyang20

Looking forward to your PR tomorrow! RDMA wins!

zhyncs avatar Feb 16 '25 20:02 zhyncs

great!

zhaochenyang20 avatar Feb 16 '25 22:02 zhaochenyang20

It's NCCL Problems..

Fixed. https://docs.sglang.ai/references/multi_node_inference_k8s_lws.html

whybeyoung avatar Feb 18 '25 05:02 whybeyoung