sglang icon indicating copy to clipboard operation
sglang copied to clipboard

Instruction for Running DeepSeek with Large-scale PD and EP

Open fzyzcjy opened this issue 7 months ago • 170 comments

Environment Preparation

  • Install SGLang on branch https://github.com/sgl-project/sglang/tree/deepseek_ep
    • ~~https://github.com/sgl-project/sglang/pull/5524~~ (EDIT: do not use this branch since I am adding more code to it after the blog, please use deepseek_ep instead)
  • ~~Install DeepEP on branch https://github.com/deepseek-ai/DeepEP/pull/142~~
    • 2025.05.08 UPDATE: Directly use latest DeepEP main is enough, since my PR has been merged
  • Install latest mooncake

It is suggested to use this Dockerfile https://github.com/sgl-project/sglang/blob/main/docker/Dockerfile.deepep to prepare dependencies of DeepEP.

Stress-testing Prefill Nodes

  • It is suggested to use 4 prefill nodes and 8 decode nodes to reproduce our results, since 4 prefill node is the settings in DeepSeek’s blog.
# prefill nodes
MC_TE_METRIC=true SGLANG_HACK_DEEPEP_NEW_MODE=0 SGL_ENABLE_JIT_DEEPGEMM=1 python3 -m sglang.launch_server --model-path ${model_path} --disaggregation-mode prefill --disaggregation-ib-device ${device_name} --host ${node_ip} --trust-remote-code --dist-init-addr ${master_ip}:5757 --nnodes ${num_prefill} --node-rank ${node_rank} --tp-size $((${num_prefill}*8)) --dp-size $((${num_prefill}*8)) --enable-dp-attention --enable-deepep-moe --deepep-mode normal --mem-fraction-static 0.85 --chunked-prefill-size $((${num_prefill}*131072)) --max-running-requests $((${num_prefill}*2048)) --max-total-tokens 131072 --context-length 8192 --init-expert-location YOUR_EXPERT_LOCATION_HERE --ep-num-redundant-experts 32 --enable-two-batch-overlap --moe-dense-tp-size 1 --disable-radix-cache --ep-dispatch-algorithm random

# decode nodes
SGLANG_HACK_DEEPEP_NEW_MODE=0 SGLANG_HACK_PD_DECODE_NUM_RESERVED_DECODE_TOKENS=102 SGL_ENABLE_JIT_DEEPGEMM=1 python3 -m sglang.launch_server --model-path ${model_path} --disaggregation-mode decode --disaggregation-ib-device ${device_name} --host ${node_ip} --trust-remote-code --dist-init-addr ${master_ip}:5757 --nnodes ${num_decode} --node-rank ${node_rank} --tp-size $((${num_decode}*8)) --dp-size $((${num_decode}*8)) --enable-dp-attention --enable-deepep-moe --deepep-mode low_latency --mem-fraction-static 0.82 --max-running-requests $((${num_decode}*1024)) --context-length 4500 --init-expert-location YOUR_EXPERT_LOCATION_HERE --enable-two-batch-overlap --moe-dense-tp-size 1 --cuda-graph-bs 128 --disable-radix-cache --decode-log-interval 1

# load balancer
python3 -m sglang.srt.disaggregation.mini_lb --prefill "http://YOUR_FIRST_PREFILL_NODE_IP:30000" --decode "http://YOUR_FIRST_DECODE_NODE_IP:30000"

# benchmark
python3 -m sglang.bench_one_batch_server --model-path ${model_path} --base-url http://YOUR_IP:8000 --batch-size 8192 --input-len 4096 --output-len 5 --skip-warmup

Stress-testing Decode Nodes

  • It is suggested to use 3 prefill nodes and 9 decode nodes to reproduce our results, since 9 decode nodes is half the size of that in DeepSeek’s blog.
  • SGLANG_HACK_PD_DECODE_NUM_RESERVED_DECODE_TOKENS can be set to benchmark-output-len + 2 to maximize batch size.
  • The example below demonstrates how to use the slow_down debug feature to stress test decode nodes when there are not enough prefill nodes. If your test workload has enough prefill nodes, this can be omitted.
# prefill nodes
SGLANG_HACK_DEEPEP_NEW_MODE=0 SGL_ENABLE_JIT_DEEPGEMM=1 python3 -m sglang.launch_server --model-path ${model_path} --disaggregation-mode prefill --disaggregation-ib-device ${device_name} --host ${node_ip} --trust-remote-code --dist-init-addr ${master_ip}:5050 --nnodes ${num_prefill} --node-rank ${node_rank} --tp-size $((${num_prefill}*8)) --dp-size $((${num_prefill}*8)) --enable-dp-attention --enable-deepep-moe --deepep-mode normal --mem-fraction-static 0.85 --chunked-prefill-size $((${num_prefill}*65536)) --max-running-requests $((${num_prefill}*2048)) --max-total-tokens 131076 --context-length 8192 --init-expert-location YOUR_EXPERT_LOCATION_HERE --ep-num-redundant-experts 32 --enable-two-batch-overlap --moe-dense-tp-size 1 --disable-radix-cache

# decode nodes
SGLANG_HACK_DEEPEP_NEW_MODE=0 SGLANG_HACK_PD_DECODE_NUM_RESERVED_DECODE_TOKENS=YOUR_NUM_HERE SGL_ENABLE_JIT_DEEPGEMM=1 python3 -m sglang.launch_server --model-path ${model_path} --disaggregation-mode decode --disaggregation-ib-device ${device_name} --host ${node_ip} --trust-remote-code --dist-init-addr ${master_ip}:5050 --nnodes ${num_decode} --node-rank ${node_rank} --tp-size $((${num_decode}*8)) --dp-size $((${num_decode}*8)) --enable-dp-attention --enable-deepep-moe --deepep-mode low_latency --mem-fraction-static 0.846 --chunked-prefill-size 81920 --max-running-requests $((${num_decode}*2048)) --context-length 4096 --init-expert-location YOUR_EXPERT_LOCATION_HERE --ep-num-redundant-experts 32 --enable-two-batch-overlap --moe-dense-tp-size 1 --cuda-graph-bs 256 --disable-radix-cache --decode-log-interval 1

# load balancer
python3 -m sglang.srt.disaggregation.mini_lb --prefill "http://YOUR_FIRST_PREFILL_NODE_IP:30000" --decode "http://YOUR_FIRST_DECODE_NODE_IP:30000"

# slow down D nodes
curl -H "Content-Type: application/json" -d '{"forward_sleep_time": 90.0}' -X POST "http://YOUR_FIRST_DECODE_NODE_IP:30000/slow_down"

# start benchmark; do not wait for this to finish before running the next line
python3 -m sglang.bench_one_batch_server --model-path /dev/shm/DeepSeek-V3-0324 --base-url http://10.10.37.16:7000 --batch-size 40000 --input-len 2000 --output-len 100 --skip-warmup

# after some time (e.g. 10 minute), the D nodes are saturated, then this command should be executed
# finish slowing down D nodes
curl -H "Content-Type: application/json" -d '{"forward_sleep_time": null}' -X POST "http://YOUR_FIRST_DECODE_NODE_IP:30000/slow_down"

Analyzing Results

Since we are stress testing one side of P or D, we need to look at the server logs instead of benchmark script outputs.

  • Prefill: For logs like Prefill batch. ... #new-token: 16384 ... gap_latency: 2.561, the performance is 16384 / 2.561 token/second/device.
  • Decode: The result can be read from gen throughput (token/s) in the logs.

Remarks

  • Please ensure the batch size is full and avoid padding, because the performance is suboptimal otherwise due to a bug we will address soon.
    • For example, to ensure a batch size of 256 for 72 decode GPUs, it is reasonable to send 40000 requests.
  • The sample command above only captures a CUDA graph of size 256 to save memory, which can be modified to suit your scenarios.
  • For optimal performance, you may need to tune components such as DeepEP on your cluster.
  • DeepGEMM warmup during execution will cause seemingly slow overall performance, and should be excluded from analyzation.
  • We rushed in the last few days, so the code is really ugly now with many hacks. We will make it elegant when merging into master.
  • For expert distribution statistics, our experiments use the same as input/output data and provide them as follows for reproducibility: attachment_ep_statistics.zip
  • To debug prefill performance, it may be useful to temporarily use --ep-dispatch-algorithm fake_grouped_uniform to simulate a fake perfect EPLB, and should match the corresponding performance reported in the blog
  • To analyze performance, it is suggested to use the log instead of benchmark script output, because the script output is mixed with the starting and ending part, where the system is not fully utilized and is slow.

Report Template

If you face any issues, feel free to discuss here or in Slack channel, and it would be great to provide the following information:

  • Full command to start server and benchmark
  • Logs of all server nodes and benchmark

fzyzcjy avatar May 05 '25 04:05 fzyzcjy

Feel free to join https://slack.sglang.ai #deepseek-large-scale-serving to discuss. Cheers!

zhyncs avatar May 05 '25 09:05 zhyncs

How can we get the --init-expert-location file?

PheasantX avatar May 06 '25 07:05 PheasantX

@PheasantX Either use the one I have already made by downloading attachment_ep_statistics.zip, or create one by yourself (I will make guidance for the latter, but I hope to firstly make everything into master so that users can use it more easily)

fzyzcjy avatar May 06 '25 07:05 fzyzcjy

@fzyzcjy Thanks! Looking forward to it.

PheasantX avatar May 06 '25 07:05 PheasantX

You are welcome!

fzyzcjy avatar May 06 '25 07:05 fzyzcjy

Hmm I do not encounter this error, but maybe try to change that line to from sglang.srt.server_args import ServerArgs

fzyzcjy avatar May 07 '25 03:05 fzyzcjy

Hi, I built the base image from the deepseek_ep branch, and then used Dockerfile.deepep to build the deepep image, but when starting sglang, I get the following error. How can I solve it? Thank you!

Image

Reinstalling sglang from the deepseek_ep branch might resolve this issue.

Nekofish-L avatar May 07 '25 03:05 Nekofish-L

During my request to the 'disaggregation.mini_lb' proxy service, an error occurred: the decode instance was unable to acquire the prefill instance's TP size, generating the following error message:

[2025-05-07 03:21:07 DP0 TP0] Error fetching prefill parallel info from bootstrap: Failed to parse: http://localhost:None/route?engine_rank=-1&target_dp_group=-1
self._get_prefill_dp_size_from_server() None
[2025-05-07 03:21:07 DP0 TP0] Scheduler hit an exception: Traceback (most recent call last):
  File "/data/root/sglang/python/sglang/srt/managers/scheduler.py", line 2233, in run_scheduler_process
    scheduler.event_loop_overlap_disagg_decode()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/data/root/sglang/python/sglang/srt/disaggregation/decode.py", line 530, in event_loop_overlap_disagg_decode
    self.process_input_requests(recv_reqs)
  File "/data/root/sglang/python/sglang/srt/managers/scheduler.py", line 810, in process_input_requests
    output = self._request_dispatcher(recv_req)
  File "/data/root/sglang/python/sglang/utils.py", line 471, in __call__
    return fn(obj)
  File "/data/root/sglang/python/sglang/srt/managers/scheduler.py", line 977, in handle_generate_request
    self._add_request_to_queue(req)
  File "/data/root/sglang/python/sglang/srt/managers/scheduler.py", line 984, in _add_request_to_queue
    self.disagg_decode_prealloc_queue.add(req)
  File "/data/root/sglang/python/sglang/srt/disaggregation/decode.py", line 142, in add
    kv_receiver = kv_receiver_class(
  File "/data/root/sglang/python/sglang/srt/disaggregation/mooncake/conn.py", line 461, in __init__
    self.prefill_dp_size, tp_size_per_dp_rank = data
TypeError: cannot unpack non-iterable NoneType object

Here is my request command:

curl localhost:8000/v1/chat/completions  \
    -H "Content-Type: application/json"   \
    -d '{
    "model": "test",
    "messages": [
    {
        "role": "user",
        "content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
    }
    ],
    "stream":false,
    "max_tokens": 30
  }'

Are there any potential solutions for this issue? Could it be resolved by specifying additional parameters in the startup command, such as explicitly defining the IP and port of the prefill node?

Nekofish-L avatar May 07 '25 03:05 Nekofish-L

Hmm I do not encounter this error, but maybe try to change that line to from sglang.srt.server_args import ServerArgs

After this modification, it is OK, Thanks.

mgw2168-1 avatar May 07 '25 05:05 mgw2168-1

@Nekofish-L Your log

[2025-05-07 03:21:07 DP0 TP0] Error fetching prefill parallel info from bootstrap: Failed to parse: http://localhost:None/route?engine_rank=-1&target_dp_group=-1

Seems to say localhost:None. Could you please check whether there are some place to put port here? Or maybe post the full start commands.

fzyzcjy avatar May 07 '25 06:05 fzyzcjy

attachment_ep_statistics.zip

@fzyzcjy May I know where to download attachment_ep_statistics.zip? I would like to have a quick try. Besides, do these ep statistics suite for different nodes configuration? I only have 2nodes of Hxx with 141G at hand. Thanks in advance.

mingxiao666 avatar May 07 '25 08:05 mingxiao666

attachment_ep_statistics.zip

@fzyzcjy May I know where to download attachment_ep_statistics.zip? I would like to have a quick try. Besides, do these ep statistics suite for different nodes configuration? I only have 2nodes of Hxx with 141G at hand. Thanks in advance.

Here, https://github.com/user-attachments/files/20036217/attachment_ep_statistics.zip

Z-NAVY avatar May 07 '25 08:05 Z-NAVY

It is suggested to use this Dockerfile https://github.com/sgl-project/sglang/blob/main/docker/Dockerfile.deepep to prepare dependencies of DeepEP.

It seems that there was already a pre-installed sglang(0.4.6.post2 /sgl-workspace/sglang/python) under the base image lmsysorg/sglang:latest, so I should also replace this sglang and reinstall with the deepseek-ep branch. Right?

Huixxi avatar May 07 '25 08:05 Huixxi

so I should also replace this sglang and reinstall with the deepseek-ep branch. Right?

Yes

fzyzcjy avatar May 07 '25 09:05 fzyzcjy

Nice work ! I reproduce on H20x4nodes (8GPUs per node) with 2 prefill nodes and 2 decode nodes, I got the error bellow:

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 115, in forward_thread_func
    with torch.get_device_module(self.device).stream(self.forward_stream):
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 595, in __exit__
    torch.cuda.set_stream(self.src_prev_stream)  # type: ignore[arg-type]
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 636, in set_stream
    _set_stream_by_id(
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 618, in _set_stream_by_id
    torch._C._cuda_setStream(
RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


[rank10]:[W507 09:29:36.399571193 CUDAGuardImpl.h:119] Warning: CUDA warning: unspecified launch failure (function destroyEvent)
[rank10]:[W507 09:29:36.399592190 CUDAGuardImpl.h:120] Warning: CUDA warning: CUDA-capable device(s) is/are busy or unavailable (function destroyEvent)
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f4ca996c1b6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f4ca9915a76 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f4ca9d71918 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x21006 (0x7f4ca9d38006 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x22507 (0x7f4ca9d39507 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x2270f (0x7f4ca9d3970f in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x6417b2 (0x7f4ca16d07b2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x6f30f (0x7f4ca994d30f in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x21b (0x7f4ca994633b in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f4ca99464e9 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #10: <unknown function> + 0x8fefb8 (0x7f4ca198dfb8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #11: THPVariable_subclass_dealloc(_object*) + 0x2f6 (0x7f4ca198e306 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #12: <unknown function> + 0x181370 (0x556208c8d370 in sglang::scheduler_DP10_TP10)
frame #13: <unknown function> + 0x194588 (0x556208ca0588 in sglang::scheduler_DP10_TP10)
frame #14: <unknown function> + 0x19459c (0x556208ca059c in sglang::scheduler_DP10_TP10)
frame #15: <unknown function> + 0x19459c (0x556208ca059c in sglang::scheduler_DP10_TP10)
frame #16: <unknown function> + 0x19459c (0x556208ca059c in sglang::scheduler_DP10_TP10)
frame #17: <unknown function> + 0x19459c (0x556208ca059c in sglang::scheduler_DP10_TP10)
frame #18: <unknown function> + 0x19459c (0x556208ca059c in sglang::scheduler_DP10_TP10)
frame #19: <unknown function> + 0x19459c (0x556208ca059c in sglang::scheduler_DP10_TP10)
frame #20: <unknown function> + 0x19459c (0x556208ca059c in sglang::scheduler_DP10_TP10)
frame #21: <unknown function> + 0x19459c (0x556208ca059c in sglang::scheduler_DP10_TP10)
frame #22: <unknown function> + 0x19459c (0x556208ca059c in sglang::scheduler_DP10_TP10)
frame #23: <unknown function> + 0x19459c (0x556208ca059c in sglang::scheduler_DP10_TP10)
frame #24: <unknown function> + 0x19459c (0x556208ca059c in sglang::scheduler_DP10_TP10)
frame #25: <unknown function> + 0x19459c (0x556208ca059c in sglang::scheduler_DP10_TP10)
frame #26: <unknown function> + 0x19459c (0x556208ca059c in sglang::scheduler_DP10_TP10)
frame #27: <unknown function> + 0x19459c (0x556208ca059c in sglang::scheduler_DP10_TP10)
frame #28: <unknown function> + 0x19459c (0x556208ca059c in sglang::scheduler_DP10_TP10)
frame #29: <unknown function> + 0x19459c (0x556208ca059c in sglang::scheduler_DP10_TP10)
frame #30: <unknown function> + 0x19459c (0x556208ca059c in sglang::scheduler_DP10_TP10)
frame #31: <unknown function> + 0x19459c (0x556208ca059c in sglang::scheduler_DP10_TP10)
frame #32: <unknown function> + 0x19459c (0x556208ca059c in sglang::scheduler_DP10_TP10)
frame #33: <unknown function> + 0x19459c (0x556208ca059c in sglang::scheduler_DP10_TP10)
frame #34: <unknown function> + 0x19459c (0x556208ca059c in sglang::scheduler_DP10_TP10)
frame #35: <unknown function> + 0x19459c (0x556208ca059c in sglang::scheduler_DP10_TP10)
frame #36: <unknown function> + 0x181370 (0x556208c8d370 in sglang::scheduler_DP10_TP10)
frame #37: <unknown function> + 0x194588 (0x556208ca0588 in sglang::scheduler_DP10_TP10)
frame #38: <unknown function> + 0x19459c (0x556208ca059c in sglang::scheduler_DP10_TP10)
frame #39: _PyEval_EvalFrameDefault + 0x2c07 (0x556208c83e77 in sglang::scheduler_DP10_TP10)
frame #40: <unknown function> + 0x1989f1 (0x556208ca49f1 in sglang::scheduler_DP10_TP10)
frame #41: _PyEval_EvalFrameDefault + 0x2a83 (0x556208c83cf3 in sglang::scheduler_DP10_TP10)
frame #42: _PyFunction_Vectorcall + 0x7c (0x556208c9766c in sglang::scheduler_DP10_TP10)
frame #43: _PyEval_EvalFrameDefault + 0x804 (0x556208c81a74 in sglang::scheduler_DP10_TP10)
frame #44: _PyFunction_Vectorcall + 0x7c (0x556208c9766c in sglang::scheduler_DP10_TP10)
frame #45: _PyEval_EvalFrameDefault + 0x804 (0x556208c81a74 in sglang::scheduler_DP10_TP10)
frame #46: <unknown function> + 0x1989f1 (0x556208ca49f1 in sglang::scheduler_DP10_TP10)
frame #47: <unknown function> + 0x2acfca (0x556208db8fca in sglang::scheduler_DP10_TP10)
frame #48: <unknown function> + 0x2a28e8 (0x556208dae8e8 in sglang::scheduler_DP10_TP10)
frame #49: <unknown function> + 0x94ac3 (0x7f4d4f37eac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #50: <unknown function> + 0x126850 (0x7f4d4f410850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Fatal Python error: Aborted

Thread 0x00007f3587fff640 (most recent call first):
  File "/usr/lib/python3.10/threading.py", line 324 in wait
  File "/usr/lib/python3.10/queue.py", line 180 in get
  File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/mooncake/conn.py", line 265 in transfer_thread
  File "/usr/lib/python3.10/threading.py", line 953 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f359cff9640 (most recent call first):
  File "/usr/local/lib/python3.10/dist-packages/zmq/sugar/socket.py", line 799 in recv_multipart
  File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/mooncake/conn.py", line 244 in bootstrap_thread
  File "/usr/lib/python3.10/threading.py", line 953 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f486c894640 (most recent call first):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1752 in watchdog_thread
  File "/usr/lib/python3.10/threading.py", line 953 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Current thread 0x00007f486d336640 (most recent call first):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 120 in forward_thread_func
  File "/usr/lib/python3.10/threading.py", line 953 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f48bb7fe640 (most recent call first):
  File "/usr/lib/python3.10/threading.py", line 324 in wait
  File "/usr/lib/python3.10/threading.py", line 607 in wait
  File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f49bedff640 (most recent call first):
  File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_worker/subproc_pool.py", line 53 in _recv_msg
  File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_worker/subproc_pool.py", line 161 in _read_thread
  File "/usr/lib/python3.10/threading.py", line 953 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f4d4f2e9480 (most recent call first):
  File "/usr/lib/python3.10/threading.py", line 320 in wait
  File "/usr/lib/python3.10/queue.py", line 171 in get
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 176 in resolve_batch_result
  File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/prefill.py", line 303 in process_batch_result_disagg_prefill
  File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/prefill.py", line 256 in event_loop_overlap_disagg_prefill
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2228 in run_scheduler_process
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108 in run
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 129 in _main
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main
  File "<string>", line 1 in <module>

Extension modules: numpy._core._multiarray_umath, numpy.linalg._umath_linalg, charset_normalizer.md, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask, aiohttp._websocket.reader_c, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, psutil._psutil_linux, psutil._psutil_posix, zmq.backend.cython._zmq, PIL._imaging, yaml._yaml, markupsafe._speedups, PIL._imagingft, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, sentencepiece._sentencepiece[2025-05-07 09:29:36 DP11 TP11] TpModelWorkerClient hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 116, in forward_thread_func
    self.forward_thread_func_()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 147, in forward_thread_func_
    logits_output, next_token_ids = self.worker.forward_batch_generation(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 181, in forward_batch_generation
    logits_output = self.model_runner.forward(forward_batch)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1122, in forward
    return self._forward_raw(forward_batch, skip_attn_backend_init)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1155, in _forward_raw
    return self.forward_extend(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1064, in forward_extend
    return self.model.forward(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 2099, in forward
    hidden_states = self.model(input_ids, positions, forward_batch, input_embeds)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1916, in forward
    hidden_states, residual = layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1462, in forward
    return self.forward_ffn_with_scattered_input(
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1617, in forward_ffn_with_scattered_input
    hidden_states = self.mlp(hidden_states, forward_batch.forward_mode)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 355, in forward
    return self.forward_deepep(hidden_states, forward_mode)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 398, in forward_deepep
    ) = self.deepep_dispatcher.dispatch_b()
  File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/ep_moe/token_dispatcher.py", line 724, in dispatch_b
    return self._get_impl(forward_mode).dispatch_b(*inner_state)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/ep_moe/token_dispatcher.py", line 275, in dispatch_b
    ) = self._dispatch_core(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/ep_moe/token_dispatcher.py", line 360, in _dispatch_core
    ) = buffer.dispatch(
  File "/usr/local/lib/python3.10/dist-packages/deep_ep/buffer.py", line 282, in dispatch
    return self.internode_dispatch(x, handle, num_tokens_per_rank, num_tokens_per_rdma_rank, is_token_in_rank, num_tokens_per_expert,
  File "/usr/local/lib/python3.10/dist-packages/deep_ep/buffer.py", line 390, in internode_dispatch
    recv_src_meta, send_rdma_head, send_nvl_head, event = self.runtime.internode_dispatch(
RuntimeError: DeepEP error: timeout (dispatch CPU)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 115, in forward_thread_func
    with torch.get_device_module(self.device).stream(self.forward_stream):
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 595, in __exit__
    torch.cuda.set_stream(self.src_prev_stream)  # type: ignore[arg-type]
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 636, in set_stream
    _set_stream_by_id(
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 618, in _set_stream_by_id
    torch._C._cuda_setStream(
RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


[rank11]:[W507 09:29:36.416755948 CUDAGuardImpl.h:119] Warning: CUDA warning: unspecified launch failure (function destroyEvent)
[rank11]:[W507 09:29:36.416774665 CUDAGuardImpl.h:120] Warning: CUDA warning: CUDA-capable device(s) is/are busy or unavailable (function destroyEvent)
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f8e3596c1b6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f8e35915a76 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f8e35d0a918 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x21006 (0x7f8e35cd1006 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x22507 (0x7f8e35cd2507 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x2270f (0x7f8e35cd270f in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x6417b2 (0x7f8e2d2d07b2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x6f30f (0x7f8e3594d30f in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x21b (0x7f8e3594633b in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f8e359464e9 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #10: <unknown function> + 0x8fefb8 (0x7f8e2d58dfb8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #11: THPVariable_subclass_dealloc(_object*) + 0x2f6 (0x7f8e2d58e306 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #12: <unknown function> + 0x181370 (0x561e425e5370 in sglang::scheduler_DP11_TP11)
frame #13: <unknown function> + 0x194588 (0x561e425f8588 in sglang::scheduler_DP11_TP11)
frame #14: <unknown function> + 0x19459c (0x561e425f859c in sglang::scheduler_DP11_TP11)
frame #15: <unknown function> + 0x19459c (0x561e425f859c in sglang::scheduler_DP11_TP11)
frame #16: <unknown function> + 0x19459c (0x561e425f859c in sglang::scheduler_DP11_TP11)
frame #17: <unknown function> + 0x19459c (0x561e425f859c in sglang::scheduler_DP11_TP11)
frame #18: <unknown function> + 0x19459c (0x561e425f859c in sglang::scheduler_DP11_TP11)
frame #19: <unknown function> + 0x19459c (0x561e425f859c in sglang::scheduler_DP11_TP11)
frame #20: <unknown function> + 0x19459c (0x561e425f859c in sglang::scheduler_DP11_TP11)
frame #21: <unknown function> + 0x19459c (0x561e425f859c in sglang::scheduler_DP11_TP11)
frame #22: <unknown function> + 0x19459c (0x561e425f859c in sglang::scheduler_DP11_TP11)
frame #23: <unknown function> + 0x19459c (0x561e425f859c in sglang::scheduler_DP11_TP11)
frame #24: <unknown function> + 0x19459c (0x561e425f859c in sglang::scheduler_DP11_TP11)
frame #25: <unknown function> + 0x19459c (0x561e425f859c in sglang::scheduler_DP11_TP11)
frame #26: <unknown function> + 0x19459c (0x561e425f859c in sglang::scheduler_DP11_TP11)
frame #27: <unknown function> + 0x19459c (0x561e425f859c in sglang::scheduler_DP11_TP11)
frame #28: <unknown function> + 0x19459c (0x561e425f859c in sglang::scheduler_DP11_TP11)
frame #29: <unknown function> + 0x19459c (0x561e425f859c in sglang::scheduler_DP11_TP11)
frame #30: <unknown function> + 0x19459c (0x561e425f859c in sglang::scheduler_DP11_TP11)
frame #31: <unknown function> + 0x19459c (0x561e425f859c in sglang::scheduler_DP11_TP11)
frame #32: <unknown function> + 0x19459c (0x561e425f859c in sglang::scheduler_DP11_TP11)
frame #33: <unknown function> + 0x19459c (0x561e425f859c in sglang::scheduler_DP11_TP11)
frame #34: <unknown function> + 0x19459c (0x561e425f859c in sglang::scheduler_DP11_TP11)
frame #35: <unknown function> + 0x19459c (0x561e425f859c in sglang::scheduler_DP11_TP11)
frame #36: <unknown function> + 0x181370 (0x561e425e5370 in sglang::scheduler_DP11_TP11)
frame #37: <unknown function> + 0x194588 (0x561e425f8588 in sglang::scheduler_DP11_TP11)
frame #38: <unknown function> + 0x19459c (0x561e425f859c in sglang::scheduler_DP11_TP11)
frame #39: _PyEval_EvalFrameDefault + 0x2c07 (0x561e425dbe77 in sglang::scheduler_DP11_TP11)
frame #40: <unknown function> + 0x1989f1 (0x561e425fc9f1 in sglang::scheduler_DP11_TP11)
frame #41: _PyEval_EvalFrameDefault + 0x2a83 (0x561e425dbcf3 in sglang::scheduler_DP11_TP11)
frame #42: _PyFunction_Vectorcall + 0x7c (0x561e425ef66c in sglang::scheduler_DP11_TP11)
frame #43: _PyEval_EvalFrameDefault + 0x804 (0x561e425d9a74 in sglang::scheduler_DP11_TP11)
frame #44: _PyFunction_Vectorcall + 0x7c (0x561e425ef66c in sglang::scheduler_DP11_TP11)
frame #45: _PyEval_EvalFrameDefault + 0x804 (0x561e425d9a74 in sglang::scheduler_DP11_TP11)
frame #46: <unknown function> + 0x1989f1 (0x561e425fc9f1 in sglang::scheduler_DP11_TP11)
frame #47: <unknown function> + 0x2acfca (0x561e42710fca in sglang::scheduler_DP11_TP11)
frame #48: <unknown function> + 0x2a28e8 (0x561e427068e8 in sglang::scheduler_DP11_TP11)
frame #49: <unknown function> + 0x94ac3 (0x7f8edb063ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #50: <unknown function> + 0x126850 (0x7f8edb0f5850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Fatal Python error: Aborted

Thread 0x00007f775a7fc640 (most recent call first):
  File "/usr/lib/python3.10/threading.py", line 324 in wait
  File "/usr/lib/python3.10/queue.py", line 180 in get
  File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/mooncake/conn.py", line 265 in transfer_thread
  File "/usr/lib/python3.10/threading.py", line 953 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f775affd640 (most recent call first):
  File "/usr/local/lib/python3.10/dist-packages/zmq/sugar/socket.py", line 799 in recv_multipart
  File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/mooncake/conn.py", line 244 in bootstrap_thread
  File "/usr/lib/python3.10/threading.py", line 953 in , runuvloop.loop
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f8a1caff640 (most recent call first):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1752 in watchdog_thread
  File "/usr/lib/python3.10/threading.py", line 953 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Current thread 0x00007f8a1dfff640 (most recent call first):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 120 in forward_thread_func
  File "/usr/lib/python3.10/threading.py", line 953,  in setproctitle._setproctitlerun
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f8a64ff9640 (most recent call first):
  File "/usr/lib/python3.10/threading.py", line 324 in wait
  File "/usr/lib/python3.10/threading.py", line 607 in wait
  File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f8b46aff640 (most recent call first):
  File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_worker/subproc_pool.py", line 53 in _recv_msg
  File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_worker/subproc_pool.py", line 161 in _read_thread
  File "/usr/lib/python3.10/threading.py", line 953 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f8edafce480 (most recent call first):
  File "/usr/lib/python3.10/threading.py", line 320 in wait
  File "/usr/lib/python3.10/queue.py", line 171 in get
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 176 in resolve_batch_result
  File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/prefill.py", line 303 in process_batch_result_disagg_prefill
  File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/prefill.py", line 256 in event_loop_overlap_disagg_prefill
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2228 in run_scheduler_process
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108 in run
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 129 in _main
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main
  File "<string>", line 1 in <module>

Extension modules: numpy._core._multiarray_umath, numpy.linalg._umath_linalg, charset_normalizer.md, cuda_utils, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask, aiohttp._websocket.reader_c, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, regex._regex, psutil._psutil_linux, psutil._psutil_posix, zmq.backend.cython._zmq, PIL._imaging, yaml._yaml, markupsafe._speedups, PIL._imagingft, __triton_launcher (total: 43)
, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, sentencepiece._sentencepiece, uvloop.loop, setproctitle._setproctitle, cuda_utils, regex._regex, __triton_launcher (total: 43)
[2025-05-07 09:29:36 DP9 TP9] TpModelWorkerClient hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 116, in forward_thread_func
    self.forward_thread_func_()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 147, in forward_thread_func_
    logits_output, next_token_ids = self.worker.forward_batch_generation(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 181, in forward_batch_generation
    logits_output = self.model_runner.forward(forward_batch)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1122, in forward
    return self._forward_raw(forward_batch, skip_attn_backend_init)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1155, in _forward_raw
    return self.forward_extend(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1064, in forward_extend
    return self.model.forward(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 2099, in forward
    hidden_states = self.model(input_ids, positions, forward_batch, input_embeds)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1916, in forward
    hidden_states, residual = layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1462, in forward
    return self.forward_ffn_with_scattered_input(
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1617, in forward_ffn_with_scattered_input
    hidden_states = self.mlp(hidden_states, forward_batch.forward_mode)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 355, in forward
    return self.forward_deepep(hidden_states, forward_mode)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 398, in forward_deepep
    ) = self.deepep_dispatcher.dispatch_b()
  File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/ep_moe/token_dispatcher.py", line 724, in dispatch_b
    return self._get_impl(forward_mode).dispatch_b(*inner_state)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/ep_moe/token_dispatcher.py", line 275, in dispatch_b
    ) = self._dispatch_core(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/ep_moe/token_dispatcher.py", line 360, in _dispatch_core
    ) = buffer.dispatch(
  File "/usr/local/lib/python3.10/dist-packages/deep_ep/buffer.py", line 282, in dispatch
    return self.internode_dispatch(x, handle, num_tokens_per_rank, num_tokens_per_rdma_rank, is_token_in_rank, num_tokens_per_expert,
  File "/usr/local/lib/python3.10/dist-packages/deep_ep/buffer.py", line 390, in internode_dispatch
    recv_src_meta, send_rdma_head, send_nvl_head, event = self.runtime.internode_dispatch(
RuntimeError: DeepEP error: timeout (dispatch CPU)

Could you help me solve it? thanks !

====================== how to reproduce ======================

  1. Prefill node 0/1
model_path="/mnt/nvme0/models/DeepSeek-R1"
device_name="mlx5_0,mlx5_3,mlx5_4,mlx5_5"
num_prefill=2
node_rank=0/1
master_ip="xxxx"

MC_TE_METRIC=true \
SGLANG_HACK_DEEPEP_NEW_MODE=0 \
SGL_ENABLE_JIT_DEEPGEMM=1 \
  python3 -m sglang.launch_server --model-path ${model_path} \
  --disaggregation-mode prefill \
  --dist-timeout 3600 \
  --dist-init-addr ${master_ip}:5757 \
  --trust-remote-code \
  --nnodes ${num_prefill} --node-rank ${node_rank} --tp-size $((${num_prefill}*8)) \
  --dp-size $((${num_prefill}*8)) --enable-dp-attention --enable-deepep-moe --deepep-mode normal \
  --ep-num-redundant-experts 32 \
  --mem-fraction-static 0.8 --chunked-prefill-size $((${num_prefill}*131072)) \
  --max-running-requests $((${num_prefill}*2048)) --max-total-tokens 131072 --context-length 8192 \
  --host 127.0.0.1 --port 40000 \
  --disaggregation-ib-device ${device_name}
  1. Decode node 0/1
model_path="/mnt/nvme0/models/DeepSeek-R1"
device_name="mlx5_0,mlx5_3,mlx5_4,mlx5_5"
num_decode=2
node_rank=0/1
master_ip="xxxx"

SGLANG_HACK_DEEPEP_NEW_MODE=0 \
SGLANG_HACK_PD_DECODE_NUM_RESERVED_DECODE_TOKENS=102 \
SGL_ENABLE_JIT_DEEPGEMM=1 \
  python3 -m sglang.launch_server --model-path ${model_path} \
  --disaggregation-mode decode \
  --dist-timeout 3600 \
  --disaggregation-transfer-backend mooncake \
  --trust-remote-code \
  --dist-init-addr ${master_ip}:5757 \
  --nnodes ${num_decode} --node-rank ${node_rank} --tp-size $((${num_decode}*8)) \
  --dp-size $((${num_decode}*8)) --enable-dp-attention \
  --enable-deepep-moe --deepep-mode low_latency --mem-fraction-static 0.82 \
  --max-running-requests $((${num_decode}*1024)) --context-length 8192 \
  --enable-two-batch-overlap \
  --moe-dense-tp-size 1 --cuda-graph-bs 128 --disable-radix-cache --decode-log-interval 1 \
  --host 0.0.0.0 --port 40000 \
  --disaggregation-ib-device ${device_name}
  1. load balancer
prefill_master_ip="xxxx"
prefill_port="40000"
decode_master_ip="xxxx"
decode_port="40000"

python3 -m sglang.srt.disaggregation.mini_lb \
        --prefill "http://${prefill_master_ip}:${prefill_port}" \
        --decode "http://${decode_master_ip}:${decode_port}"
  1. benchmark
model_path="/mnt/nvme0/models/DeepSeek-R1"
base_url="http://xxxx:8000"
python3 -m sglang.bench_one_batch_server --model-path ${model_path} \
        --base-url ${base_url} \
        --batch-size 8192 --input-len 4096 --output-len 5 --skip-warmup

titus-hpc avatar May 07 '25 09:05 titus-hpc

Looks like DeepEP error: timeout. Could you please check all nodes' logs to see whether there are other errors before this? Often it is caused by e.g. one node fails.

fzyzcjy avatar May 07 '25 10:05 fzyzcjy

@fzyzcjy May you please help me here to check what is the problem? Thanks in advance.

I could successfully completed the first 3 steps: step1 on node1 : SGLANG_HACK_DEEPEP_NEW_MODE=0 SGL_ENABLE_JIT_DEEPGEMM=1 python3 -m sglang.launch_server --model-path ~/.cache/huggingface/hub/models--deepseek-ai--DeepSeek-V3/snapshots/1d044fd82b15f1cedb197a288e50cc96a2c27205/ --disaggregation-mode prefill --trust-remote-code --dist-init-addr 10.6.131.1:5757 --nnodes 1 --node-rank 0 --tp-size 8 --dp-size 8 --enable-dp-attention --enable-deepep-moe --deepep-mode normal --mem-fraction-static 0.85 --chunked-prefill-size 65536 --max-running-requests 2048 --max-total-tokens 131076 --context-length 8192 --init-expert-location /root/.cache/huggingface/attachment_ep_statistics/decode_in1000out1000.json --ep-num-redundant-experts 32 --enable-two-batch-overlap --moe-dense-tp-size 1 --disable-radix-cache > prefill.log 2>&1 &

step2 on node2 : SGLANG_HACK_DEEPEP_NEW_MODE=0 SGLANG_HACK_PD_DECODE_NUM_RESERVED_DECODE_TOKENS=102 SGL_ENABLE_JIT_DEEPGEMM=1 python3 -m sglang.launch_server --model-path ~/.cache/huggingface/hub/models--deepseek-ai--DeepSeek-V3/snapshots/1d044fd82b15f1cedb197a288e50cc96a2c27205/ --disaggregation-mode decode --trust-remote-code --dist-init-addr 10.6.131.2:5757 --nnodes 1 --node-rank 0 --tp-size 8 --dp-size 8 --enable-dp-attention --enable-deepep-moe --deepep-mode low_latency --mem-fraction-static 0.82 --max-running-requests 1024 --context-length 4500 --init-expert-location /root/.cache/huggingface/attachment_ep_statistics/decode_in1000out1000.json --enable-two-batch-overlap --moe-dense-tp-size 1 --cuda-graph-bs 128 --disable-radix-cache --decode-log-interval 1 > decoder.log 2>&1 &

step3 on node1: python3 -m sglang.srt.disaggregation.mini_lb --prefill "http://10.6.131.1:30000" --decode "http://10.6.131.2:30000" > loader.log 2>&1 &

However, when I tried to run step4, step4 on node1: python3 -m sglang.bench_one_batch_server --model-path ~/.cache/huggingface/hub/models--deepseek-ai--DeepSeek-V3/snapshots/1d044fd82b15f1cedb197a288e50cc96a2c27205/ --base-url http://10.6.131.1:8000 --batch-size 256 --input-len 4096 --output-len 5 --skip-warmup > bench.log 2>&1 &

I met following error: File "/usr/local/lib/python3.10/dist-packages/requests/models.py", line 974, in json return complexjson.loads(self.text, **kwargs) File "/usr/lib/python3.10/json/init.py", line 346, in loads return _default_decoder.decode(s) File "/usr/lib/python3.10/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/lib/python3.10/json/decoder.py", line 355, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

if I rerun step1,2,3 and tried to run bench in another way: python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompt 3000 --random-input 1000 --random-output 1000 --max-concurrency 64 --random-range-ratio 1 --output-file deepseek_v3_8xh200_FP8_online_output.jsonl --host 127.0.0.1 --port 30000

I met this error "aiohttp.client_exceptions.ClientPayloadError: Response payload is not completed: <TransferEncodingError: 400, message='Not enough data for satisfy transfer length header.'>"

SW version Info see below: sglang commit: commit 6fa2c029c23659615e2757aa6d10ac9d95d28f25 (HEAD -> feat/dev_branch, origin/feat/dev_branch) Author: fzyzcjy [email protected] Date: Sun May 4 20:12:10 2025 +0800

chore

DeepEP commit: commit 23ded3bd8d692755674ffb9ba18794701b6090e6 (HEAD -> patch-3, origin/patch-3) Author: fzyzcjy [email protected] Date: Tue Apr 29 09:58:31 2025 +0800

Update deep_ep.cpp

Moncake commit: commit 168cc22f31d91e1272661372cdc262a0157d761a (HEAD -> main, origin/main, origin/HEAD) Author: Feng Ren [email protected] Date: Wed May 7 10:19:48 2025 +0800

[DOC] Update README components (#331)

mingxiao666 avatar May 07 '25 10:05 mingxiao666

@mingxiao666 Hi, could you please provide more full logs? Also I cannot see whether that error comes from your bench command or server... If bench command, could you please firstly try to curl the mini_lb and send a example prompt to it and see whether it can provide a response?

fzyzcjy avatar May 07 '25 10:05 fzyzcjy

@mingxiao666 Hi, could you please provide more full logs? Also I cannot see whether that error comes from your bench command or server... If bench command, could you please firstly try to curl the mini_lb and send a example prompt to it and see whether it can provide a response?

thanks for quick reply, your guess makes sense.

The error above is on bench side, for loader, the error is as below: for root@H20-GPU-01:~/.cache/huggingface/sglang-deepep# tail -f loader.log proto = await self._create_connection(req, traces, timeout) File "/usr/local/lib/python3.10/dist-packages/aiohttp/connector.py", line 1056, in _create_connection _, proto = await self._create_direct_connection(req, traces, timeout) File "/usr/local/lib/python3.10/dist-packages/aiohttp/connector.py", line 1406, in _create_direct_connection raise last_exc File "/usr/local/lib/python3.10/dist-packages/aiohttp/connector.py", line 1375, in _create_direct_connection transp, proto = await self._wrap_create_connection( File "/usr/local/lib/python3.10/dist-packages/aiohttp/connector.py", line 1130, in _wrap_create_connection raise client_error(req.connection_key, exc) from exc aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host 10.6.131.1:30000 ssl:default [Connect call failed ('10.6.131.1', 30000)]

no error log print on prefill & deocder side.

I fell quite confused about the port number in above 4 steps, as it is shown(I basically copied it from your guide), for step1 & step2, port number is 5757, for step3, port number is 30000, for step4, port number is 8000(http://10.6.131.1:8000/), I guess something is wrong with port number. But when I changed port from 8000 to 30000 in step4, it is still not working. May you please help confirm whether the setting of port number is okay? Thanks in advance.

mingxiao666 avatar May 07 '25 10:05 mingxiao666

I try to run it on H20, but I encountered the following error when capturing cuda graph on the decoding nodes. Adding --disable-cuda-graph can fix it, but the decoding speed is too slow. Could you please tell me how to solve this with cuda graph?

free(): invalid pointer
Fatal Python error: Aborted

Thread 0x00007f6e867fc640 (most recent call first):
  File "/usr/lib/python3.10/threading.py", line 324 in wait
  File "/usr/lib/python3.10/threading.py", line 607 in wait
  File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f6fa9bff640 (most recent call first):
  File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_worker/subproc_pool.py", line 53 in _recv_msg
  File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_worker/subproc_pool.py", line 161 in _read_thread
  File "/usr/lib/python3.10/threading.py", line 953 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Current thread 0x00007f7344ab01c0 (most recent call first):
  File "/usr/local/lib/python3.10/dist-packages/deep_ep/buffer.py", line 88 in __init__
  File "/root/sglang/python/sglang/srt/layers/moe/ep_moe/token_dispatcher.py", line 155 in get_deepep_buffer
  File "/root/sglang/python/sglang/srt/layers/moe/ep_moe/token_dispatcher.py", line 653 in _get_buffer
  File "/root/sglang/python/sglang/srt/layers/moe/ep_moe/token_dispatcher.py", line 513 in dispatch_a
  File "/root/sglang/python/sglang/srt/layers/moe/ep_moe/token_dispatcher.py", line 713 in dispatch_a
  File "/root/sglang/python/sglang/srt/models/deepseek_v2.py", line 518 in _forward_deepep_dispatch_a_part_two
  File "/root/sglang/python/sglang/srt/models/deepseek_v2.py", line 553 in _forward_tbo_op_dispatch_a_part_two
  File "/root/sglang/python/sglang/srt/two_batch_overlap.py", line 192 in next
  File "/root/sglang/python/sglang/srt/two_batch_overlap.py", line 166 in _execute_two_batch_raw
  File "/root/sglang/python/sglang/srt/two_batch_overlap.py", line 148 in model_forward_execute_two_batch
  File "/root/sglang/python/sglang/srt/models/deepseek_v2.py", line 2016 in _forward_tbo_layers
  File "/root/sglang/python/sglang/srt/models/deepseek_v2.py", line 1927 in forward
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750 in _call_impl
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739 in _wrapped_call_impl
  File "/root/sglang/python/sglang/srt/models/deepseek_v2.py", line 2099 in forward
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
  File "/root/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 476 in run_once
  File "/root/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 483 in capture_one_batch_size
  File "/root/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 374 in capture
  File "/root/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 283 in __init__
  File "/root/sglang/python/sglang/srt/model_executor/model_runner.py", line 1036 in init_cuda_graphs
  File "/root/sglang/python/sglang/srt/model_executor/model_runner.py", line 272 in initialize
  File "/root/sglang/python/sglang/srt/model_executor/model_runner.py", line 220 in __init__
  File "/root/sglang/python/sglang/srt/managers/tp_worker.py", line 78 in __init__
  File "/root/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 65 in __init__
  File "/root/sglang/python/sglang/srt/managers/scheduler.py", line 291 in __init__
  File "/root/sglang/python/sglang/srt/managers/scheduler.py", line 2209 in run_scheduler_process
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108 in run
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 129 in _main
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main
  File "<string>", line 1 in <module>

Extension modules: numpy._core._multiarray_umath, numpy.linalg._umath_linalg, charset_normalizer.md, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask, aiohttp._websocket.reader_c, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, psutil._psutil_linux, psutil._psutil_posix, zmq.backend.cython._zmq, PIL._imaging, yaml._yaml, markupsafe._speedups, PIL._imagingft, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, sentencepiece._sentencepiece, cython.cimports.libc.math, Cython.Utils, Cython.Plex.Actions, Cython.Plex.Transitions, Cython.Plex.Machines, Cython.Plex.DFA, Cython.Plex.Scanners, Cython.Compiler.Scanning, Cython.StringIOTree, Cython.Compiler.Code, uvloop.loop, setproctitle._setproctitle, cuda_utils, regex._regex (total: 52)
munmap_chunk(): invalid pointer
Fatal Python error: Aborted

tanconghui avatar May 07 '25 11:05 tanconghui

Looks like DeepEP error: timeout. Could you please check all nodes' logs to see whether there are other errors before this? Often it is caused by e.g. one node fails.

Yeah, you're right. The error is that one H20 node is out of CUDA memory.

titus-hpc avatar May 07 '25 12:05 titus-hpc

I fell quite confused about the port number in above 4 steps, as it is shown(I basically copied it from your guide), for step1 & step2, port number is 5757, for step3, port number is 30000, for step4, port number is 8000(http://10.6.131.1:8000/), I guess something is wrong with port number. But when I changed port from 8000 to 30000 in step4, it is still not working. May you please help confirm whether the setting of port number is okay? Thanks in advance.

  • --dist-init-addr: the port that is used internally for multi node to coordinate
  • --port: the port to expose to end users
    • but in this case, the port for prefill nodes and decode nodes should NOT be used directly. instead, users should contact the mini_lb
  • mini_lb will create a port (by default 8000) that is used by normal users, and it will contact prefill nodes and decode nodes inside

So a quick glance does not seem to see wrong ports. But feel free to use the curl trick to have a check whether there is something wrong.

fzyzcjy avatar May 07 '25 12:05 fzyzcjy

@tanconghui Hi, could you please show full logs? (probably in a gist etc)

Also I am wondering whether it can be caused by e.g. OOM

fzyzcjy avatar May 07 '25 12:05 fzyzcjy

@fzyzcjy May you please help me here to check what is the problem? Thanks in advance.

I could successfully completed the first 3 steps: step1 on node1 : SGLANG_HACK_DEEPEP_NEW_MODE=0 SGL_ENABLE_JIT_DEEPGEMM=1 python3 -m sglang.launch_server --model-path ~/.cache/huggingface/hub/models--deepseek-ai--DeepSeek-V3/snapshots/1d044fd82b15f1cedb197a288e50cc96a2c27205/ --disaggregation-mode prefill --trust-remote-code --dist-init-addr 10.6.131.1:5757 --nnodes 1 --node-rank 0 --tp-size 8 --dp-size 8 --enable-dp-attention --enable-deepep-moe --deepep-mode normal --mem-fraction-static 0.85 --chunked-prefill-size 65536 --max-running-requests 2048 --max-total-tokens 131076 --context-length 8192 --init-expert-location /root/.cache/huggingface/attachment_ep_statistics/decode_in1000out1000.json --ep-num-redundant-experts 32 --enable-two-batch-overlap --moe-dense-tp-size 1 --disable-radix-cache > prefill.log 2>&1 &

step2 on node2 : SGLANG_HACK_DEEPEP_NEW_MODE=0 SGLANG_HACK_PD_DECODE_NUM_RESERVED_DECODE_TOKENS=102 SGL_ENABLE_JIT_DEEPGEMM=1 python3 -m sglang.launch_server --model-path ~/.cache/huggingface/hub/models--deepseek-ai--DeepSeek-V3/snapshots/1d044fd82b15f1cedb197a288e50cc96a2c27205/ --disaggregation-mode decode --trust-remote-code --dist-init-addr 10.6.131.2:5757 --nnodes 1 --node-rank 0 --tp-size 8 --dp-size 8 --enable-dp-attention --enable-deepep-moe --deepep-mode low_latency --mem-fraction-static 0.82 --max-running-requests 1024 --context-length 4500 --init-expert-location /root/.cache/huggingface/attachment_ep_statistics/decode_in1000out1000.json --enable-two-batch-overlap --moe-dense-tp-size 1 --cuda-graph-bs 128 --disable-radix-cache --decode-log-interval 1 > decoder.log 2>&1 &

step3 on node1: python3 -m sglang.srt.disaggregation.mini_lb --prefill "http://10.6.131.1:30000" --decode "http://10.6.131.2:30000" > loader.log 2>&1 &

However, when I tried to run step4, step4 on node1: python3 -m sglang.bench_one_batch_server --model-path ~/.cache/huggingface/hub/models--deepseek-ai--DeepSeek-V3/snapshots/1d044fd82b15f1cedb197a288e50cc96a2c27205/ --base-url http://10.6.131.1:8000 --batch-size 256 --input-len 4096 --output-len 5 --skip-warmup > bench.log 2>&1 &

I met following error: File "/usr/local/lib/python3.10/dist-packages/requests/models.py", line 974, in json return complexjson.loads(self.text, **kwargs) File "/usr/lib/python3.10/json/init.py", line 346, in loads return _default_decoder.decode(s) File "/usr/lib/python3.10/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/lib/python3.10/json/decoder.py", line 355, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

if I rerun step1,2,3 and tried to run bench in another way: python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompt 3000 --random-input 1000 --random-output 1000 --max-concurrency 64 --random-range-ratio 1 --output-file deepseek_v3_8xh200_FP8_online_output.jsonl --host 127.0.0.1 --port 30000

I met this error "aiohttp.client_exceptions.ClientPayloadError: Response payload is not completed: <TransferEncodingError: 400, message='Not enough data for satisfy transfer length header.'>"

SW version Info see below: sglang commit: commit 6fa2c02 (HEAD -> feat/dev_branch, origin/feat/dev_branch) Author: fzyzcjy [email protected] Date: Sun May 4 20:12:10 2025 +0800

chore

DeepEP commit: commit 23ded3bd8d692755674ffb9ba18794701b6090e6 (HEAD -> patch-3, origin/patch-3) Author: fzyzcjy [email protected] Date: Tue Apr 29 09:58:31 2025 +0800

Update deep_ep.cpp

Moncake commit: commit 168cc22f31d91e1272661372cdc262a0157d761a (HEAD -> main, origin/main, origin/HEAD) Author: Feng Ren [email protected] Date: Wed May 7 10:19:48 2025 +0800

[DOC] Update README components (#331)

I also meet this error. : (

Z-NAVY avatar May 07 '25 13:05 Z-NAVY

@Z-NAVY Hi could you please try to curl it to see what is happening

fzyzcjy avatar May 07 '25 13:05 fzyzcjy

I try to run it on H20, and I also encountered the cuda graph error as follows:

2025-05-07 13:37:52 DP7 TP7] DeepGEMM JIT Compiling for <gemm_fp8_fp8_bf16_nt> M=32, N=7168, K=2048. Please wait.
[2025-05-07 13:38:35 DP5 TP5] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 283, in __init__
    self.capture()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 374, in capture
    ) = self.capture_one_batch_size(bs, forward)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 483, in capture_one_batch_size
    run_once()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 476, in run_once
    logits_output = forward(input_ids, forward_batch.positions, forward_batch)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 2099, in forward
    hidden_states = self.model(input_ids, positions, forward_batch, input_embeds)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1927, in forward
    hidden_states, residual = self._forward_tbo_layers(
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 2016, in _forward_tbo_layers
    return two_batch_overlap.model_forward_execute_two_batch(
  File "/sgl-workspace/sglang/python/sglang/srt/two_batch_overlap.py", line 148, in model_forward_execute_two_batch
    output_a, output_b = _execute_two_batch_raw(
  File "/sgl-workspace/sglang/python/sglang/srt/two_batch_overlap.py", line 166, in _execute_two_batch_raw
    executor_a.next()
  File "/sgl-workspace/sglang/python/sglang/srt/two_batch_overlap.py", line 192, in next
    self._stage_output = op.fn(
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 582, in _forward_tbo_op_combine_a
    self.tbo_deepep_dispatchers[state.tbo_subbatch_index].combine_a(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/ep_moe/token_dispatcher.py", line 737, in combine_a
    inner_state = self._get_impl(forward_mode).combine_a(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/ep_moe/token_dispatcher.py", line 622, in combine_a
    hidden_states, event, hook = self._combine_core(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/ep_moe/token_dispatcher.py", line 640, in _combine_core
    combined_hidden_states, event, hook = buffer.low_latency_combine(
  File "/usr/local/lib/python3.10/dist-packages/deep_ep/buffer.py", line 533, in low_latency_combine
    combined_x, event, hook = self.runtime.low_latency_combine(x, topk_idx, topk_weights, src_info, layout_range,
RuntimeError: Failed: CUDA error /sgl-workspace/DeepEP/csrc/kernels/internode_ll.cu:532 'too many blocks in cooperative launch'
During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2209, in run_scheduler_process
    scheduler = Scheduler(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 291, in __init__
    self.tp_worker = TpWorkerClass(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 65, in __init__
    self.worker = TpModelWorker(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 78, in __init__
    self.model_runner = ModelRunner(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 220, in __init__
    self.initialize(min_per_gpu_memory)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 272, in initialize
    self.init_cuda_graphs()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1036, in init_cuda_graphs
    self.cuda_graph_runner = CudaGraphRunner(self)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 285, in __init__
    raise Exception(
Exception: Capture cuda graph failed: Failed: CUDA error /sgl-workspace/DeepEP/csrc/kernels/internode_ll.cu:532 'too many blocks in cooperative launch'
Possible solutions:
1. set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
2. set --cuda-graph-max-bs to a smaller value (e.g., 16)
3. disable torch compile by not using --enable-torch-compile
4. disable cuda graph by --disable-cuda-graph. (Not recommonded. Huge perf loss)
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose 

feng397 avatar May 07 '25 13:05 feng397

@feng397 For that error, could you please try https://github.com/sgl-project/sglang/blob/38053c3372dd220911987bd8cb55b27448366497/python/sglang/srt/layers/moe/ep_moe/token_dispatcher.py#L441

fzyzcjy avatar May 07 '25 14:05 fzyzcjy

@feng397 For that error, could you please try

sglang/python/sglang/srt/layers/moe/ep_moe/token_dispatcher.py

Line 441 in 38053c3

     # For H20, there will be an CUDA error: DeepEP/csrc/kernels/internode_ll.cu:337 'too many blocks in cooperative launch'.

Thanks! It works! However, after I sent test request, the decode node present error as follows:

INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:20000 (Press CTRL+C to quit)
[2025-05-07 14:43:04 DP9 TP9] Error fetching prefill parallel info from bootstrap: Failed to parse: http://192.168.0.108:None/route?engine_rank=-1&target_dp_group=-1
[2025-05-07 14:43:04 DP9 TP9] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2233, in run_scheduler_process
    scheduler.event_loop_overlap_disagg_decode()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/decode.py", line 530, in event_loop_overlap_disagg_decode
    self.process_input_requests(recv_reqs)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 810, in process_input_requests
    output = self._request_dispatcher(recv_req)
  File "/sgl-workspace/sglang/python/sglang/utils.py", line 471, in __call__
    return fn(obj)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 977, in handle_generate_request
    self._add_request_to_queue(req)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 984, in _add_request_to_queue
    self.disagg_decode_prealloc_queue.add(req)
  File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/decode.py", line 142, in add
    kv_receiver = kv_receiver_class(
  File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/mooncake/conn.py", line 459, in __init__
    self.prefill_dp_size, tp_size_per_dp_rank = (
TypeError: cannot unpack non-iterable NoneType object

[2025-05-07 14:43:05] Child process unexpectedly failed with an exit code 131. pid=13
[2025-05-07 14:43:05] Child process unexpectedly failed with an exit code 9. pid=177
[2025-05-07 14:43:05] Child process unexpectedly failed with an exit code 9. pid=754

feng397 avatar May 07 '25 14:05 feng397

@fzyzcjy after rerun step1,2,3, curl server will fail:

curl http://10.6.131.1:30000/server_info

% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 curl: (7) Failed to connect to 10.6.131.1 port 30000 after 0 ms: Connection refused

mingxiao666 avatar May 07 '25 14:05 mingxiao666

@mingxiao666 Try to connect to 8000 (mini_lb)

fzyzcjy avatar May 07 '25 15:05 fzyzcjy