sglang icon indicating copy to clipboard operation
sglang copied to clipboard

Hang at around 4% during the CUDA graph loading process

Open BoBo0037 opened this issue 10 months ago • 9 comments

Hi, I try to running deepseek-r1 with two H20 server nodes, but hang at around 4% during the CUDA graph loading process, no any error. How can I fix this issue? If I add the --disable-cuda-graph flag, the problem doesn't occur

Image

BoBo0037 avatar Feb 13 '25 10:02 BoBo0037

thanks for raising this issue, it is similar to this issue #3538

minleminzui avatar Feb 13 '25 15:02 minleminzui

Hi, I try to running deepseek-r1 with two H20 server nodes, but hang at around 4% during the CUDA graph loading process, no any error. How can I fix this issue? If I add the --disable-cuda-graph flag, the problem doesn't occur

Image

it looks like you are using the official docker?

LJL36 avatar Feb 14 '25 03:02 LJL36

it looks like you are using the official docker?

yes,im using sglang docker

BoBo0037 avatar Feb 14 '25 03:02 BoBo0037

try to reduce --cuda-graph-max-bs=32

zhyncs avatar Feb 14 '25 03:02 zhyncs

try to reduce --cuda-graph-max-bs=32

I try to running deepseek-r1 with 48A100, Hang at around 17% during the CUDA graph loading process,

Image what should i do

zhaotyer avatar Feb 14 '25 03:02 zhaotyer

try to reduce --cuda-graph-max-bs=32

hi, if i use --cuda-graph-max-bs=32 , will hang at 14% ... if i usie --cuda-graph-max-bs=16 , will hang at 20% ... Image

BoBo0037 avatar Feb 14 '25 04:02 BoBo0037

try to reduce --cuda-graph-max-bs=32

hi, if i use --cuda-graph-max-bs=32 , will hang at 14% ... if i usie --cuda-graph-max-bs=16 , will hang at 20% ... Image

me too

zhaotyer avatar Feb 14 '25 04:02 zhaotyer

update nccl to nccl 2.24,fixed hangs when running with different CPU architectures. https://docs.nvidia.com/deeplearning/nccl/release-notes/rel_2-24-3.html#rel_2-24-3

Image

desertchen avatar Feb 17 '25 09:02 desertchen

update nccl to nccl 2.24,fixed hangs when running with different CPU architectures. https://docs.nvidia.com/deeplearning/nccl/release-notes/rel_2-24-3.html#rel_2-24-3

Image

thx,it solves my problem!

LJL36 avatar Feb 19 '25 14:02 LJL36