Instruction for Running DeepSeek with Large-scale PD and EP
Environment Preparation
- Install SGLang on branch https://github.com/sgl-project/sglang/tree/deepseek_ep
- ~~https://github.com/sgl-project/sglang/pull/5524~~ (EDIT: do not use this branch since I am adding more code to it after the blog, please use deepseek_ep instead)
- ~~Install DeepEP on branch https://github.com/deepseek-ai/DeepEP/pull/142~~
- 2025.05.08 UPDATE: Directly use latest DeepEP main is enough, since my PR has been merged
- Install latest mooncake
It is suggested to use this Dockerfile https://github.com/sgl-project/sglang/blob/main/docker/Dockerfile.deepep to prepare dependencies of DeepEP.
Stress-testing Prefill Nodes
- It is suggested to use 4 prefill nodes and 8 decode nodes to reproduce our results, since 4 prefill node is the settings in DeepSeek’s blog.
# prefill nodes
MC_TE_METRIC=true SGLANG_HACK_DEEPEP_NEW_MODE=0 SGL_ENABLE_JIT_DEEPGEMM=1 python3 -m sglang.launch_server --model-path ${model_path} --disaggregation-mode prefill --disaggregation-ib-device ${device_name} --host ${node_ip} --trust-remote-code --dist-init-addr ${master_ip}:5757 --nnodes ${num_prefill} --node-rank ${node_rank} --tp-size $((${num_prefill}*8)) --dp-size $((${num_prefill}*8)) --enable-dp-attention --enable-deepep-moe --deepep-mode normal --mem-fraction-static 0.85 --chunked-prefill-size $((${num_prefill}*131072)) --max-running-requests $((${num_prefill}*2048)) --max-total-tokens 131072 --context-length 8192 --init-expert-location YOUR_EXPERT_LOCATION_HERE --ep-num-redundant-experts 32 --enable-two-batch-overlap --moe-dense-tp-size 1 --disable-radix-cache --ep-dispatch-algorithm random
# decode nodes
SGLANG_HACK_DEEPEP_NEW_MODE=0 SGLANG_HACK_PD_DECODE_NUM_RESERVED_DECODE_TOKENS=102 SGL_ENABLE_JIT_DEEPGEMM=1 python3 -m sglang.launch_server --model-path ${model_path} --disaggregation-mode decode --disaggregation-ib-device ${device_name} --host ${node_ip} --trust-remote-code --dist-init-addr ${master_ip}:5757 --nnodes ${num_decode} --node-rank ${node_rank} --tp-size $((${num_decode}*8)) --dp-size $((${num_decode}*8)) --enable-dp-attention --enable-deepep-moe --deepep-mode low_latency --mem-fraction-static 0.82 --max-running-requests $((${num_decode}*1024)) --context-length 4500 --init-expert-location YOUR_EXPERT_LOCATION_HERE --enable-two-batch-overlap --moe-dense-tp-size 1 --cuda-graph-bs 128 --disable-radix-cache --decode-log-interval 1
# load balancer
python3 -m sglang.srt.disaggregation.mini_lb --prefill "http://YOUR_FIRST_PREFILL_NODE_IP:30000" --decode "http://YOUR_FIRST_DECODE_NODE_IP:30000"
# benchmark
python3 -m sglang.bench_one_batch_server --model-path ${model_path} --base-url http://YOUR_IP:8000 --batch-size 8192 --input-len 4096 --output-len 5 --skip-warmup
Stress-testing Decode Nodes
- It is suggested to use 3 prefill nodes and 9 decode nodes to reproduce our results, since 9 decode nodes is half the size of that in DeepSeek’s blog.
SGLANG_HACK_PD_DECODE_NUM_RESERVED_DECODE_TOKENScan be set tobenchmark-output-len + 2to maximize batch size.- The example below demonstrates how to use the slow_down debug feature to stress test decode nodes when there are not enough prefill nodes. If your test workload has enough prefill nodes, this can be omitted.
# prefill nodes
SGLANG_HACK_DEEPEP_NEW_MODE=0 SGL_ENABLE_JIT_DEEPGEMM=1 python3 -m sglang.launch_server --model-path ${model_path} --disaggregation-mode prefill --disaggregation-ib-device ${device_name} --host ${node_ip} --trust-remote-code --dist-init-addr ${master_ip}:5050 --nnodes ${num_prefill} --node-rank ${node_rank} --tp-size $((${num_prefill}*8)) --dp-size $((${num_prefill}*8)) --enable-dp-attention --enable-deepep-moe --deepep-mode normal --mem-fraction-static 0.85 --chunked-prefill-size $((${num_prefill}*65536)) --max-running-requests $((${num_prefill}*2048)) --max-total-tokens 131076 --context-length 8192 --init-expert-location YOUR_EXPERT_LOCATION_HERE --ep-num-redundant-experts 32 --enable-two-batch-overlap --moe-dense-tp-size 1 --disable-radix-cache
# decode nodes
SGLANG_HACK_DEEPEP_NEW_MODE=0 SGLANG_HACK_PD_DECODE_NUM_RESERVED_DECODE_TOKENS=YOUR_NUM_HERE SGL_ENABLE_JIT_DEEPGEMM=1 python3 -m sglang.launch_server --model-path ${model_path} --disaggregation-mode decode --disaggregation-ib-device ${device_name} --host ${node_ip} --trust-remote-code --dist-init-addr ${master_ip}:5050 --nnodes ${num_decode} --node-rank ${node_rank} --tp-size $((${num_decode}*8)) --dp-size $((${num_decode}*8)) --enable-dp-attention --enable-deepep-moe --deepep-mode low_latency --mem-fraction-static 0.846 --chunked-prefill-size 81920 --max-running-requests $((${num_decode}*2048)) --context-length 4096 --init-expert-location YOUR_EXPERT_LOCATION_HERE --ep-num-redundant-experts 32 --enable-two-batch-overlap --moe-dense-tp-size 1 --cuda-graph-bs 256 --disable-radix-cache --decode-log-interval 1
# load balancer
python3 -m sglang.srt.disaggregation.mini_lb --prefill "http://YOUR_FIRST_PREFILL_NODE_IP:30000" --decode "http://YOUR_FIRST_DECODE_NODE_IP:30000"
# slow down D nodes
curl -H "Content-Type: application/json" -d '{"forward_sleep_time": 90.0}' -X POST "http://YOUR_FIRST_DECODE_NODE_IP:30000/slow_down"
# start benchmark; do not wait for this to finish before running the next line
python3 -m sglang.bench_one_batch_server --model-path /dev/shm/DeepSeek-V3-0324 --base-url http://10.10.37.16:7000 --batch-size 40000 --input-len 2000 --output-len 100 --skip-warmup
# after some time (e.g. 10 minute), the D nodes are saturated, then this command should be executed
# finish slowing down D nodes
curl -H "Content-Type: application/json" -d '{"forward_sleep_time": null}' -X POST "http://YOUR_FIRST_DECODE_NODE_IP:30000/slow_down"
Analyzing Results
Since we are stress testing one side of P or D, we need to look at the server logs instead of benchmark script outputs.
- Prefill: For logs like
Prefill batch. ... #new-token: 16384 ... gap_latency: 2.561, the performance is16384 / 2.561token/second/device. - Decode: The result can be read from
gen throughput (token/s)in the logs.
Remarks
- Please ensure the batch size is full and avoid padding, because the performance is suboptimal otherwise due to a bug we will address soon.
- For example, to ensure a batch size of 256 for 72 decode GPUs, it is reasonable to send 40000 requests.
- The sample command above only captures a CUDA graph of size 256 to save memory, which can be modified to suit your scenarios.
- For optimal performance, you may need to tune components such as DeepEP on your cluster.
- DeepGEMM warmup during execution will cause seemingly slow overall performance, and should be excluded from analyzation.
- We rushed in the last few days, so the code is really ugly now with many hacks. We will make it elegant when merging into master.
- For expert distribution statistics, our experiments use the same as input/output data and provide them as follows for reproducibility: attachment_ep_statistics.zip
- To debug prefill performance, it may be useful to temporarily use
--ep-dispatch-algorithm fake_grouped_uniformto simulate a fake perfect EPLB, and should match the corresponding performance reported in the blog - To analyze performance, it is suggested to use the log instead of benchmark script output, because the script output is mixed with the starting and ending part, where the system is not fully utilized and is slow.
Report Template
If you face any issues, feel free to discuss here or in Slack channel, and it would be great to provide the following information:
- Full command to start server and benchmark
- Logs of all server nodes and benchmark
Feel free to join https://slack.sglang.ai #deepseek-large-scale-serving to discuss. Cheers!
How can we get the --init-expert-location file?
@PheasantX Either use the one I have already made by downloading attachment_ep_statistics.zip, or create one by yourself (I will make guidance for the latter, but I hope to firstly make everything into master so that users can use it more easily)
@fzyzcjy Thanks! Looking forward to it.
You are welcome!
Hmm I do not encounter this error, but maybe try to change that line to from sglang.srt.server_args import ServerArgs
Hi, I built the base image from the deepseek_ep branch, and then used Dockerfile.deepep to build the deepep image, but when starting sglang, I get the following error. How can I solve it? Thank you!
Reinstalling sglang from the deepseek_ep branch might resolve this issue.
During my request to the 'disaggregation.mini_lb' proxy service, an error occurred: the decode instance was unable to acquire the prefill instance's TP size, generating the following error message:
[2025-05-07 03:21:07 DP0 TP0] Error fetching prefill parallel info from bootstrap: Failed to parse: http://localhost:None/route?engine_rank=-1&target_dp_group=-1
self._get_prefill_dp_size_from_server() None
[2025-05-07 03:21:07 DP0 TP0] Scheduler hit an exception: Traceback (most recent call last):
File "/data/root/sglang/python/sglang/srt/managers/scheduler.py", line 2233, in run_scheduler_process
scheduler.event_loop_overlap_disagg_decode()
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/data/root/sglang/python/sglang/srt/disaggregation/decode.py", line 530, in event_loop_overlap_disagg_decode
self.process_input_requests(recv_reqs)
File "/data/root/sglang/python/sglang/srt/managers/scheduler.py", line 810, in process_input_requests
output = self._request_dispatcher(recv_req)
File "/data/root/sglang/python/sglang/utils.py", line 471, in __call__
return fn(obj)
File "/data/root/sglang/python/sglang/srt/managers/scheduler.py", line 977, in handle_generate_request
self._add_request_to_queue(req)
File "/data/root/sglang/python/sglang/srt/managers/scheduler.py", line 984, in _add_request_to_queue
self.disagg_decode_prealloc_queue.add(req)
File "/data/root/sglang/python/sglang/srt/disaggregation/decode.py", line 142, in add
kv_receiver = kv_receiver_class(
File "/data/root/sglang/python/sglang/srt/disaggregation/mooncake/conn.py", line 461, in __init__
self.prefill_dp_size, tp_size_per_dp_rank = data
TypeError: cannot unpack non-iterable NoneType object
Here is my request command:
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "test",
"messages": [
{
"role": "user",
"content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
}
],
"stream":false,
"max_tokens": 30
}'
Are there any potential solutions for this issue? Could it be resolved by specifying additional parameters in the startup command, such as explicitly defining the IP and port of the prefill node?
Hmm I do not encounter this error, but maybe try to change that line to
from sglang.srt.server_args import ServerArgs
After this modification, it is OK, Thanks.
@Nekofish-L Your log
[2025-05-07 03:21:07 DP0 TP0] Error fetching prefill parallel info from bootstrap: Failed to parse: http://localhost:None/route?engine_rank=-1&target_dp_group=-1
Seems to say localhost:None. Could you please check whether there are some place to put port here? Or maybe post the full start commands.
attachment_ep_statistics.zip
@fzyzcjy May I know where to download attachment_ep_statistics.zip? I would like to have a quick try. Besides, do these ep statistics suite for different nodes configuration? I only have 2nodes of Hxx with 141G at hand. Thanks in advance.
attachment_ep_statistics.zip
@fzyzcjy May I know where to download attachment_ep_statistics.zip? I would like to have a quick try. Besides, do these ep statistics suite for different nodes configuration? I only have 2nodes of Hxx with 141G at hand. Thanks in advance.
Here, https://github.com/user-attachments/files/20036217/attachment_ep_statistics.zip
It is suggested to use this Dockerfile https://github.com/sgl-project/sglang/blob/main/docker/Dockerfile.deepep to prepare dependencies of DeepEP.
It seems that there was already a pre-installed sglang(0.4.6.post2 /sgl-workspace/sglang/python) under the base image lmsysorg/sglang:latest, so I should also replace this sglang and reinstall with the deepseek-ep branch. Right?
so I should also replace this sglang and reinstall with the deepseek-ep branch. Right?
Yes
Nice work ! I reproduce on H20x4nodes (8GPUs per node) with 2 prefill nodes and 2 decode nodes, I got the error bellow:
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 115, in forward_thread_func
with torch.get_device_module(self.device).stream(self.forward_stream):
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 595, in __exit__
torch.cuda.set_stream(self.src_prev_stream) # type: ignore[arg-type]
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 636, in set_stream
_set_stream_by_id(
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 618, in _set_stream_by_id
torch._C._cuda_setStream(
RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
[rank10]:[W507 09:29:36.399571193 CUDAGuardImpl.h:119] Warning: CUDA warning: unspecified launch failure (function destroyEvent)
[rank10]:[W507 09:29:36.399592190 CUDAGuardImpl.h:120] Warning: CUDA warning: CUDA-capable device(s) is/are busy or unavailable (function destroyEvent)
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f4ca996c1b6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f4ca9915a76 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f4ca9d71918 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x21006 (0x7f4ca9d38006 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x22507 (0x7f4ca9d39507 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x2270f (0x7f4ca9d3970f in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x6417b2 (0x7f4ca16d07b2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x6f30f (0x7f4ca994d30f in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x21b (0x7f4ca994633b in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f4ca99464e9 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #10: <unknown function> + 0x8fefb8 (0x7f4ca198dfb8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #11: THPVariable_subclass_dealloc(_object*) + 0x2f6 (0x7f4ca198e306 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #12: <unknown function> + 0x181370 (0x556208c8d370 in sglang::scheduler_DP10_TP10)
frame #13: <unknown function> + 0x194588 (0x556208ca0588 in sglang::scheduler_DP10_TP10)
frame #14: <unknown function> + 0x19459c (0x556208ca059c in sglang::scheduler_DP10_TP10)
frame #15: <unknown function> + 0x19459c (0x556208ca059c in sglang::scheduler_DP10_TP10)
frame #16: <unknown function> + 0x19459c (0x556208ca059c in sglang::scheduler_DP10_TP10)
frame #17: <unknown function> + 0x19459c (0x556208ca059c in sglang::scheduler_DP10_TP10)
frame #18: <unknown function> + 0x19459c (0x556208ca059c in sglang::scheduler_DP10_TP10)
frame #19: <unknown function> + 0x19459c (0x556208ca059c in sglang::scheduler_DP10_TP10)
frame #20: <unknown function> + 0x19459c (0x556208ca059c in sglang::scheduler_DP10_TP10)
frame #21: <unknown function> + 0x19459c (0x556208ca059c in sglang::scheduler_DP10_TP10)
frame #22: <unknown function> + 0x19459c (0x556208ca059c in sglang::scheduler_DP10_TP10)
frame #23: <unknown function> + 0x19459c (0x556208ca059c in sglang::scheduler_DP10_TP10)
frame #24: <unknown function> + 0x19459c (0x556208ca059c in sglang::scheduler_DP10_TP10)
frame #25: <unknown function> + 0x19459c (0x556208ca059c in sglang::scheduler_DP10_TP10)
frame #26: <unknown function> + 0x19459c (0x556208ca059c in sglang::scheduler_DP10_TP10)
frame #27: <unknown function> + 0x19459c (0x556208ca059c in sglang::scheduler_DP10_TP10)
frame #28: <unknown function> + 0x19459c (0x556208ca059c in sglang::scheduler_DP10_TP10)
frame #29: <unknown function> + 0x19459c (0x556208ca059c in sglang::scheduler_DP10_TP10)
frame #30: <unknown function> + 0x19459c (0x556208ca059c in sglang::scheduler_DP10_TP10)
frame #31: <unknown function> + 0x19459c (0x556208ca059c in sglang::scheduler_DP10_TP10)
frame #32: <unknown function> + 0x19459c (0x556208ca059c in sglang::scheduler_DP10_TP10)
frame #33: <unknown function> + 0x19459c (0x556208ca059c in sglang::scheduler_DP10_TP10)
frame #34: <unknown function> + 0x19459c (0x556208ca059c in sglang::scheduler_DP10_TP10)
frame #35: <unknown function> + 0x19459c (0x556208ca059c in sglang::scheduler_DP10_TP10)
frame #36: <unknown function> + 0x181370 (0x556208c8d370 in sglang::scheduler_DP10_TP10)
frame #37: <unknown function> + 0x194588 (0x556208ca0588 in sglang::scheduler_DP10_TP10)
frame #38: <unknown function> + 0x19459c (0x556208ca059c in sglang::scheduler_DP10_TP10)
frame #39: _PyEval_EvalFrameDefault + 0x2c07 (0x556208c83e77 in sglang::scheduler_DP10_TP10)
frame #40: <unknown function> + 0x1989f1 (0x556208ca49f1 in sglang::scheduler_DP10_TP10)
frame #41: _PyEval_EvalFrameDefault + 0x2a83 (0x556208c83cf3 in sglang::scheduler_DP10_TP10)
frame #42: _PyFunction_Vectorcall + 0x7c (0x556208c9766c in sglang::scheduler_DP10_TP10)
frame #43: _PyEval_EvalFrameDefault + 0x804 (0x556208c81a74 in sglang::scheduler_DP10_TP10)
frame #44: _PyFunction_Vectorcall + 0x7c (0x556208c9766c in sglang::scheduler_DP10_TP10)
frame #45: _PyEval_EvalFrameDefault + 0x804 (0x556208c81a74 in sglang::scheduler_DP10_TP10)
frame #46: <unknown function> + 0x1989f1 (0x556208ca49f1 in sglang::scheduler_DP10_TP10)
frame #47: <unknown function> + 0x2acfca (0x556208db8fca in sglang::scheduler_DP10_TP10)
frame #48: <unknown function> + 0x2a28e8 (0x556208dae8e8 in sglang::scheduler_DP10_TP10)
frame #49: <unknown function> + 0x94ac3 (0x7f4d4f37eac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #50: <unknown function> + 0x126850 (0x7f4d4f410850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
Fatal Python error: Aborted
Thread 0x00007f3587fff640 (most recent call first):
File "/usr/lib/python3.10/threading.py", line 324 in wait
File "/usr/lib/python3.10/queue.py", line 180 in get
File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/mooncake/conn.py", line 265 in transfer_thread
File "/usr/lib/python3.10/threading.py", line 953 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap
Thread 0x00007f359cff9640 (most recent call first):
File "/usr/local/lib/python3.10/dist-packages/zmq/sugar/socket.py", line 799 in recv_multipart
File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/mooncake/conn.py", line 244 in bootstrap_thread
File "/usr/lib/python3.10/threading.py", line 953 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap
Thread 0x00007f486c894640 (most recent call first):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1752 in watchdog_thread
File "/usr/lib/python3.10/threading.py", line 953 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap
Current thread 0x00007f486d336640 (most recent call first):
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 120 in forward_thread_func
File "/usr/lib/python3.10/threading.py", line 953 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap
Thread 0x00007f48bb7fe640 (most recent call first):
File "/usr/lib/python3.10/threading.py", line 324 in wait
File "/usr/lib/python3.10/threading.py", line 607 in wait
File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap
Thread 0x00007f49bedff640 (most recent call first):
File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_worker/subproc_pool.py", line 53 in _recv_msg
File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_worker/subproc_pool.py", line 161 in _read_thread
File "/usr/lib/python3.10/threading.py", line 953 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap
Thread 0x00007f4d4f2e9480 (most recent call first):
File "/usr/lib/python3.10/threading.py", line 320 in wait
File "/usr/lib/python3.10/queue.py", line 171 in get
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 176 in resolve_batch_result
File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/prefill.py", line 303 in process_batch_result_disagg_prefill
File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/prefill.py", line 256 in event_loop_overlap_disagg_prefill
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2228 in run_scheduler_process
File "/usr/lib/python3.10/multiprocessing/process.py", line 108 in run
File "/usr/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 129 in _main
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main
File "<string>", line 1 in <module>
Extension modules: numpy._core._multiarray_umath, numpy.linalg._umath_linalg, charset_normalizer.md, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask, aiohttp._websocket.reader_c, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, psutil._psutil_linux, psutil._psutil_posix, zmq.backend.cython._zmq, PIL._imaging, yaml._yaml, markupsafe._speedups, PIL._imagingft, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, sentencepiece._sentencepiece[2025-05-07 09:29:36 DP11 TP11] TpModelWorkerClient hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 116, in forward_thread_func
self.forward_thread_func_()
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 147, in forward_thread_func_
logits_output, next_token_ids = self.worker.forward_batch_generation(
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 181, in forward_batch_generation
logits_output = self.model_runner.forward(forward_batch)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1122, in forward
return self._forward_raw(forward_batch, skip_attn_backend_init)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1155, in _forward_raw
return self.forward_extend(
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1064, in forward_extend
return self.model.forward(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 2099, in forward
hidden_states = self.model(input_ids, positions, forward_batch, input_embeds)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1916, in forward
hidden_states, residual = layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1462, in forward
return self.forward_ffn_with_scattered_input(
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1617, in forward_ffn_with_scattered_input
hidden_states = self.mlp(hidden_states, forward_batch.forward_mode)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 355, in forward
return self.forward_deepep(hidden_states, forward_mode)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 398, in forward_deepep
) = self.deepep_dispatcher.dispatch_b()
File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/ep_moe/token_dispatcher.py", line 724, in dispatch_b
return self._get_impl(forward_mode).dispatch_b(*inner_state)
File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/ep_moe/token_dispatcher.py", line 275, in dispatch_b
) = self._dispatch_core(
File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/ep_moe/token_dispatcher.py", line 360, in _dispatch_core
) = buffer.dispatch(
File "/usr/local/lib/python3.10/dist-packages/deep_ep/buffer.py", line 282, in dispatch
return self.internode_dispatch(x, handle, num_tokens_per_rank, num_tokens_per_rdma_rank, is_token_in_rank, num_tokens_per_expert,
File "/usr/local/lib/python3.10/dist-packages/deep_ep/buffer.py", line 390, in internode_dispatch
recv_src_meta, send_rdma_head, send_nvl_head, event = self.runtime.internode_dispatch(
RuntimeError: DeepEP error: timeout (dispatch CPU)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 115, in forward_thread_func
with torch.get_device_module(self.device).stream(self.forward_stream):
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 595, in __exit__
torch.cuda.set_stream(self.src_prev_stream) # type: ignore[arg-type]
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 636, in set_stream
_set_stream_by_id(
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 618, in _set_stream_by_id
torch._C._cuda_setStream(
RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
[rank11]:[W507 09:29:36.416755948 CUDAGuardImpl.h:119] Warning: CUDA warning: unspecified launch failure (function destroyEvent)
[rank11]:[W507 09:29:36.416774665 CUDAGuardImpl.h:120] Warning: CUDA warning: CUDA-capable device(s) is/are busy or unavailable (function destroyEvent)
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f8e3596c1b6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f8e35915a76 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f8e35d0a918 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x21006 (0x7f8e35cd1006 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x22507 (0x7f8e35cd2507 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x2270f (0x7f8e35cd270f in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x6417b2 (0x7f8e2d2d07b2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x6f30f (0x7f8e3594d30f in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x21b (0x7f8e3594633b in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f8e359464e9 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #10: <unknown function> + 0x8fefb8 (0x7f8e2d58dfb8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #11: THPVariable_subclass_dealloc(_object*) + 0x2f6 (0x7f8e2d58e306 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #12: <unknown function> + 0x181370 (0x561e425e5370 in sglang::scheduler_DP11_TP11)
frame #13: <unknown function> + 0x194588 (0x561e425f8588 in sglang::scheduler_DP11_TP11)
frame #14: <unknown function> + 0x19459c (0x561e425f859c in sglang::scheduler_DP11_TP11)
frame #15: <unknown function> + 0x19459c (0x561e425f859c in sglang::scheduler_DP11_TP11)
frame #16: <unknown function> + 0x19459c (0x561e425f859c in sglang::scheduler_DP11_TP11)
frame #17: <unknown function> + 0x19459c (0x561e425f859c in sglang::scheduler_DP11_TP11)
frame #18: <unknown function> + 0x19459c (0x561e425f859c in sglang::scheduler_DP11_TP11)
frame #19: <unknown function> + 0x19459c (0x561e425f859c in sglang::scheduler_DP11_TP11)
frame #20: <unknown function> + 0x19459c (0x561e425f859c in sglang::scheduler_DP11_TP11)
frame #21: <unknown function> + 0x19459c (0x561e425f859c in sglang::scheduler_DP11_TP11)
frame #22: <unknown function> + 0x19459c (0x561e425f859c in sglang::scheduler_DP11_TP11)
frame #23: <unknown function> + 0x19459c (0x561e425f859c in sglang::scheduler_DP11_TP11)
frame #24: <unknown function> + 0x19459c (0x561e425f859c in sglang::scheduler_DP11_TP11)
frame #25: <unknown function> + 0x19459c (0x561e425f859c in sglang::scheduler_DP11_TP11)
frame #26: <unknown function> + 0x19459c (0x561e425f859c in sglang::scheduler_DP11_TP11)
frame #27: <unknown function> + 0x19459c (0x561e425f859c in sglang::scheduler_DP11_TP11)
frame #28: <unknown function> + 0x19459c (0x561e425f859c in sglang::scheduler_DP11_TP11)
frame #29: <unknown function> + 0x19459c (0x561e425f859c in sglang::scheduler_DP11_TP11)
frame #30: <unknown function> + 0x19459c (0x561e425f859c in sglang::scheduler_DP11_TP11)
frame #31: <unknown function> + 0x19459c (0x561e425f859c in sglang::scheduler_DP11_TP11)
frame #32: <unknown function> + 0x19459c (0x561e425f859c in sglang::scheduler_DP11_TP11)
frame #33: <unknown function> + 0x19459c (0x561e425f859c in sglang::scheduler_DP11_TP11)
frame #34: <unknown function> + 0x19459c (0x561e425f859c in sglang::scheduler_DP11_TP11)
frame #35: <unknown function> + 0x19459c (0x561e425f859c in sglang::scheduler_DP11_TP11)
frame #36: <unknown function> + 0x181370 (0x561e425e5370 in sglang::scheduler_DP11_TP11)
frame #37: <unknown function> + 0x194588 (0x561e425f8588 in sglang::scheduler_DP11_TP11)
frame #38: <unknown function> + 0x19459c (0x561e425f859c in sglang::scheduler_DP11_TP11)
frame #39: _PyEval_EvalFrameDefault + 0x2c07 (0x561e425dbe77 in sglang::scheduler_DP11_TP11)
frame #40: <unknown function> + 0x1989f1 (0x561e425fc9f1 in sglang::scheduler_DP11_TP11)
frame #41: _PyEval_EvalFrameDefault + 0x2a83 (0x561e425dbcf3 in sglang::scheduler_DP11_TP11)
frame #42: _PyFunction_Vectorcall + 0x7c (0x561e425ef66c in sglang::scheduler_DP11_TP11)
frame #43: _PyEval_EvalFrameDefault + 0x804 (0x561e425d9a74 in sglang::scheduler_DP11_TP11)
frame #44: _PyFunction_Vectorcall + 0x7c (0x561e425ef66c in sglang::scheduler_DP11_TP11)
frame #45: _PyEval_EvalFrameDefault + 0x804 (0x561e425d9a74 in sglang::scheduler_DP11_TP11)
frame #46: <unknown function> + 0x1989f1 (0x561e425fc9f1 in sglang::scheduler_DP11_TP11)
frame #47: <unknown function> + 0x2acfca (0x561e42710fca in sglang::scheduler_DP11_TP11)
frame #48: <unknown function> + 0x2a28e8 (0x561e427068e8 in sglang::scheduler_DP11_TP11)
frame #49: <unknown function> + 0x94ac3 (0x7f8edb063ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #50: <unknown function> + 0x126850 (0x7f8edb0f5850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
Fatal Python error: Aborted
Thread 0x00007f775a7fc640 (most recent call first):
File "/usr/lib/python3.10/threading.py", line 324 in wait
File "/usr/lib/python3.10/queue.py", line 180 in get
File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/mooncake/conn.py", line 265 in transfer_thread
File "/usr/lib/python3.10/threading.py", line 953 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap
Thread 0x00007f775affd640 (most recent call first):
File "/usr/local/lib/python3.10/dist-packages/zmq/sugar/socket.py", line 799 in recv_multipart
File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/mooncake/conn.py", line 244 in bootstrap_thread
File "/usr/lib/python3.10/threading.py", line 953 in , runuvloop.loop
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap
Thread 0x00007f8a1caff640 (most recent call first):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1752 in watchdog_thread
File "/usr/lib/python3.10/threading.py", line 953 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap
Current thread 0x00007f8a1dfff640 (most recent call first):
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 120 in forward_thread_func
File "/usr/lib/python3.10/threading.py", line 953, in setproctitle._setproctitlerun
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap
Thread 0x00007f8a64ff9640 (most recent call first):
File "/usr/lib/python3.10/threading.py", line 324 in wait
File "/usr/lib/python3.10/threading.py", line 607 in wait
File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap
Thread 0x00007f8b46aff640 (most recent call first):
File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_worker/subproc_pool.py", line 53 in _recv_msg
File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_worker/subproc_pool.py", line 161 in _read_thread
File "/usr/lib/python3.10/threading.py", line 953 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap
Thread 0x00007f8edafce480 (most recent call first):
File "/usr/lib/python3.10/threading.py", line 320 in wait
File "/usr/lib/python3.10/queue.py", line 171 in get
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 176 in resolve_batch_result
File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/prefill.py", line 303 in process_batch_result_disagg_prefill
File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/prefill.py", line 256 in event_loop_overlap_disagg_prefill
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2228 in run_scheduler_process
File "/usr/lib/python3.10/multiprocessing/process.py", line 108 in run
File "/usr/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 129 in _main
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main
File "<string>", line 1 in <module>
Extension modules: numpy._core._multiarray_umath, numpy.linalg._umath_linalg, charset_normalizer.md, cuda_utils, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask, aiohttp._websocket.reader_c, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, regex._regex, psutil._psutil_linux, psutil._psutil_posix, zmq.backend.cython._zmq, PIL._imaging, yaml._yaml, markupsafe._speedups, PIL._imagingft, __triton_launcher (total: 43)
, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, sentencepiece._sentencepiece, uvloop.loop, setproctitle._setproctitle, cuda_utils, regex._regex, __triton_launcher (total: 43)
[2025-05-07 09:29:36 DP9 TP9] TpModelWorkerClient hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 116, in forward_thread_func
self.forward_thread_func_()
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 147, in forward_thread_func_
logits_output, next_token_ids = self.worker.forward_batch_generation(
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 181, in forward_batch_generation
logits_output = self.model_runner.forward(forward_batch)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1122, in forward
return self._forward_raw(forward_batch, skip_attn_backend_init)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1155, in _forward_raw
return self.forward_extend(
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1064, in forward_extend
return self.model.forward(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 2099, in forward
hidden_states = self.model(input_ids, positions, forward_batch, input_embeds)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1916, in forward
hidden_states, residual = layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1462, in forward
return self.forward_ffn_with_scattered_input(
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1617, in forward_ffn_with_scattered_input
hidden_states = self.mlp(hidden_states, forward_batch.forward_mode)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 355, in forward
return self.forward_deepep(hidden_states, forward_mode)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 398, in forward_deepep
) = self.deepep_dispatcher.dispatch_b()
File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/ep_moe/token_dispatcher.py", line 724, in dispatch_b
return self._get_impl(forward_mode).dispatch_b(*inner_state)
File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/ep_moe/token_dispatcher.py", line 275, in dispatch_b
) = self._dispatch_core(
File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/ep_moe/token_dispatcher.py", line 360, in _dispatch_core
) = buffer.dispatch(
File "/usr/local/lib/python3.10/dist-packages/deep_ep/buffer.py", line 282, in dispatch
return self.internode_dispatch(x, handle, num_tokens_per_rank, num_tokens_per_rdma_rank, is_token_in_rank, num_tokens_per_expert,
File "/usr/local/lib/python3.10/dist-packages/deep_ep/buffer.py", line 390, in internode_dispatch
recv_src_meta, send_rdma_head, send_nvl_head, event = self.runtime.internode_dispatch(
RuntimeError: DeepEP error: timeout (dispatch CPU)
Could you help me solve it? thanks !
====================== how to reproduce ======================
- Prefill node 0/1
model_path="/mnt/nvme0/models/DeepSeek-R1"
device_name="mlx5_0,mlx5_3,mlx5_4,mlx5_5"
num_prefill=2
node_rank=0/1
master_ip="xxxx"
MC_TE_METRIC=true \
SGLANG_HACK_DEEPEP_NEW_MODE=0 \
SGL_ENABLE_JIT_DEEPGEMM=1 \
python3 -m sglang.launch_server --model-path ${model_path} \
--disaggregation-mode prefill \
--dist-timeout 3600 \
--dist-init-addr ${master_ip}:5757 \
--trust-remote-code \
--nnodes ${num_prefill} --node-rank ${node_rank} --tp-size $((${num_prefill}*8)) \
--dp-size $((${num_prefill}*8)) --enable-dp-attention --enable-deepep-moe --deepep-mode normal \
--ep-num-redundant-experts 32 \
--mem-fraction-static 0.8 --chunked-prefill-size $((${num_prefill}*131072)) \
--max-running-requests $((${num_prefill}*2048)) --max-total-tokens 131072 --context-length 8192 \
--host 127.0.0.1 --port 40000 \
--disaggregation-ib-device ${device_name}
- Decode node 0/1
model_path="/mnt/nvme0/models/DeepSeek-R1"
device_name="mlx5_0,mlx5_3,mlx5_4,mlx5_5"
num_decode=2
node_rank=0/1
master_ip="xxxx"
SGLANG_HACK_DEEPEP_NEW_MODE=0 \
SGLANG_HACK_PD_DECODE_NUM_RESERVED_DECODE_TOKENS=102 \
SGL_ENABLE_JIT_DEEPGEMM=1 \
python3 -m sglang.launch_server --model-path ${model_path} \
--disaggregation-mode decode \
--dist-timeout 3600 \
--disaggregation-transfer-backend mooncake \
--trust-remote-code \
--dist-init-addr ${master_ip}:5757 \
--nnodes ${num_decode} --node-rank ${node_rank} --tp-size $((${num_decode}*8)) \
--dp-size $((${num_decode}*8)) --enable-dp-attention \
--enable-deepep-moe --deepep-mode low_latency --mem-fraction-static 0.82 \
--max-running-requests $((${num_decode}*1024)) --context-length 8192 \
--enable-two-batch-overlap \
--moe-dense-tp-size 1 --cuda-graph-bs 128 --disable-radix-cache --decode-log-interval 1 \
--host 0.0.0.0 --port 40000 \
--disaggregation-ib-device ${device_name}
- load balancer
prefill_master_ip="xxxx"
prefill_port="40000"
decode_master_ip="xxxx"
decode_port="40000"
python3 -m sglang.srt.disaggregation.mini_lb \
--prefill "http://${prefill_master_ip}:${prefill_port}" \
--decode "http://${decode_master_ip}:${decode_port}"
- benchmark
model_path="/mnt/nvme0/models/DeepSeek-R1"
base_url="http://xxxx:8000"
python3 -m sglang.bench_one_batch_server --model-path ${model_path} \
--base-url ${base_url} \
--batch-size 8192 --input-len 4096 --output-len 5 --skip-warmup
Looks like DeepEP error: timeout. Could you please check all nodes' logs to see whether there are other errors before this? Often it is caused by e.g. one node fails.
@fzyzcjy May you please help me here to check what is the problem? Thanks in advance.
I could successfully completed the first 3 steps: step1 on node1 : SGLANG_HACK_DEEPEP_NEW_MODE=0 SGL_ENABLE_JIT_DEEPGEMM=1 python3 -m sglang.launch_server --model-path ~/.cache/huggingface/hub/models--deepseek-ai--DeepSeek-V3/snapshots/1d044fd82b15f1cedb197a288e50cc96a2c27205/ --disaggregation-mode prefill --trust-remote-code --dist-init-addr 10.6.131.1:5757 --nnodes 1 --node-rank 0 --tp-size 8 --dp-size 8 --enable-dp-attention --enable-deepep-moe --deepep-mode normal --mem-fraction-static 0.85 --chunked-prefill-size 65536 --max-running-requests 2048 --max-total-tokens 131076 --context-length 8192 --init-expert-location /root/.cache/huggingface/attachment_ep_statistics/decode_in1000out1000.json --ep-num-redundant-experts 32 --enable-two-batch-overlap --moe-dense-tp-size 1 --disable-radix-cache > prefill.log 2>&1 &
step2 on node2 : SGLANG_HACK_DEEPEP_NEW_MODE=0 SGLANG_HACK_PD_DECODE_NUM_RESERVED_DECODE_TOKENS=102 SGL_ENABLE_JIT_DEEPGEMM=1 python3 -m sglang.launch_server --model-path ~/.cache/huggingface/hub/models--deepseek-ai--DeepSeek-V3/snapshots/1d044fd82b15f1cedb197a288e50cc96a2c27205/ --disaggregation-mode decode --trust-remote-code --dist-init-addr 10.6.131.2:5757 --nnodes 1 --node-rank 0 --tp-size 8 --dp-size 8 --enable-dp-attention --enable-deepep-moe --deepep-mode low_latency --mem-fraction-static 0.82 --max-running-requests 1024 --context-length 4500 --init-expert-location /root/.cache/huggingface/attachment_ep_statistics/decode_in1000out1000.json --enable-two-batch-overlap --moe-dense-tp-size 1 --cuda-graph-bs 128 --disable-radix-cache --decode-log-interval 1 > decoder.log 2>&1 &
step3 on node1: python3 -m sglang.srt.disaggregation.mini_lb --prefill "http://10.6.131.1:30000" --decode "http://10.6.131.2:30000" > loader.log 2>&1 &
However, when I tried to run step4, step4 on node1: python3 -m sglang.bench_one_batch_server --model-path ~/.cache/huggingface/hub/models--deepseek-ai--DeepSeek-V3/snapshots/1d044fd82b15f1cedb197a288e50cc96a2c27205/ --base-url http://10.6.131.1:8000 --batch-size 256 --input-len 4096 --output-len 5 --skip-warmup > bench.log 2>&1 &
I met following error: File "/usr/local/lib/python3.10/dist-packages/requests/models.py", line 974, in json return complexjson.loads(self.text, **kwargs) File "/usr/lib/python3.10/json/init.py", line 346, in loads return _default_decoder.decode(s) File "/usr/lib/python3.10/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/lib/python3.10/json/decoder.py", line 355, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
if I rerun step1,2,3 and tried to run bench in another way: python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompt 3000 --random-input 1000 --random-output 1000 --max-concurrency 64 --random-range-ratio 1 --output-file deepseek_v3_8xh200_FP8_online_output.jsonl --host 127.0.0.1 --port 30000
I met this error "aiohttp.client_exceptions.ClientPayloadError: Response payload is not completed: <TransferEncodingError: 400, message='Not enough data for satisfy transfer length header.'>"
SW version Info see below: sglang commit: commit 6fa2c029c23659615e2757aa6d10ac9d95d28f25 (HEAD -> feat/dev_branch, origin/feat/dev_branch) Author: fzyzcjy [email protected] Date: Sun May 4 20:12:10 2025 +0800
chore
DeepEP commit: commit 23ded3bd8d692755674ffb9ba18794701b6090e6 (HEAD -> patch-3, origin/patch-3) Author: fzyzcjy [email protected] Date: Tue Apr 29 09:58:31 2025 +0800
Update deep_ep.cpp
Moncake commit: commit 168cc22f31d91e1272661372cdc262a0157d761a (HEAD -> main, origin/main, origin/HEAD) Author: Feng Ren [email protected] Date: Wed May 7 10:19:48 2025 +0800
[DOC] Update README components (#331)
@mingxiao666 Hi, could you please provide more full logs? Also I cannot see whether that error comes from your bench command or server... If bench command, could you please firstly try to curl the mini_lb and send a example prompt to it and see whether it can provide a response?
@mingxiao666 Hi, could you please provide more full logs? Also I cannot see whether that error comes from your bench command or server... If bench command, could you please firstly try to
curlthe mini_lb and send a example prompt to it and see whether it can provide a response?
thanks for quick reply, your guess makes sense.
The error above is on bench side, for loader, the error is as below: for root@H20-GPU-01:~/.cache/huggingface/sglang-deepep# tail -f loader.log proto = await self._create_connection(req, traces, timeout) File "/usr/local/lib/python3.10/dist-packages/aiohttp/connector.py", line 1056, in _create_connection _, proto = await self._create_direct_connection(req, traces, timeout) File "/usr/local/lib/python3.10/dist-packages/aiohttp/connector.py", line 1406, in _create_direct_connection raise last_exc File "/usr/local/lib/python3.10/dist-packages/aiohttp/connector.py", line 1375, in _create_direct_connection transp, proto = await self._wrap_create_connection( File "/usr/local/lib/python3.10/dist-packages/aiohttp/connector.py", line 1130, in _wrap_create_connection raise client_error(req.connection_key, exc) from exc aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host 10.6.131.1:30000 ssl:default [Connect call failed ('10.6.131.1', 30000)]
no error log print on prefill & deocder side.
I fell quite confused about the port number in above 4 steps, as it is shown(I basically copied it from your guide), for step1 & step2, port number is 5757, for step3, port number is 30000, for step4, port number is 8000(http://10.6.131.1:8000/), I guess something is wrong with port number. But when I changed port from 8000 to 30000 in step4, it is still not working. May you please help confirm whether the setting of port number is okay? Thanks in advance.
I try to run it on H20, but I encountered the following error when capturing cuda graph on the decoding nodes. Adding --disable-cuda-graph can fix it, but the decoding speed is too slow. Could you please tell me how to solve this with cuda graph?
free(): invalid pointer
Fatal Python error: Aborted
Thread 0x00007f6e867fc640 (most recent call first):
File "/usr/lib/python3.10/threading.py", line 324 in wait
File "/usr/lib/python3.10/threading.py", line 607 in wait
File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap
Thread 0x00007f6fa9bff640 (most recent call first):
File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_worker/subproc_pool.py", line 53 in _recv_msg
File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_worker/subproc_pool.py", line 161 in _read_thread
File "/usr/lib/python3.10/threading.py", line 953 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap
Current thread 0x00007f7344ab01c0 (most recent call first):
File "/usr/local/lib/python3.10/dist-packages/deep_ep/buffer.py", line 88 in __init__
File "/root/sglang/python/sglang/srt/layers/moe/ep_moe/token_dispatcher.py", line 155 in get_deepep_buffer
File "/root/sglang/python/sglang/srt/layers/moe/ep_moe/token_dispatcher.py", line 653 in _get_buffer
File "/root/sglang/python/sglang/srt/layers/moe/ep_moe/token_dispatcher.py", line 513 in dispatch_a
File "/root/sglang/python/sglang/srt/layers/moe/ep_moe/token_dispatcher.py", line 713 in dispatch_a
File "/root/sglang/python/sglang/srt/models/deepseek_v2.py", line 518 in _forward_deepep_dispatch_a_part_two
File "/root/sglang/python/sglang/srt/models/deepseek_v2.py", line 553 in _forward_tbo_op_dispatch_a_part_two
File "/root/sglang/python/sglang/srt/two_batch_overlap.py", line 192 in next
File "/root/sglang/python/sglang/srt/two_batch_overlap.py", line 166 in _execute_two_batch_raw
File "/root/sglang/python/sglang/srt/two_batch_overlap.py", line 148 in model_forward_execute_two_batch
File "/root/sglang/python/sglang/srt/models/deepseek_v2.py", line 2016 in _forward_tbo_layers
File "/root/sglang/python/sglang/srt/models/deepseek_v2.py", line 1927 in forward
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750 in _call_impl
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739 in _wrapped_call_impl
File "/root/sglang/python/sglang/srt/models/deepseek_v2.py", line 2099 in forward
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
File "/root/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 476 in run_once
File "/root/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 483 in capture_one_batch_size
File "/root/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 374 in capture
File "/root/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 283 in __init__
File "/root/sglang/python/sglang/srt/model_executor/model_runner.py", line 1036 in init_cuda_graphs
File "/root/sglang/python/sglang/srt/model_executor/model_runner.py", line 272 in initialize
File "/root/sglang/python/sglang/srt/model_executor/model_runner.py", line 220 in __init__
File "/root/sglang/python/sglang/srt/managers/tp_worker.py", line 78 in __init__
File "/root/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 65 in __init__
File "/root/sglang/python/sglang/srt/managers/scheduler.py", line 291 in __init__
File "/root/sglang/python/sglang/srt/managers/scheduler.py", line 2209 in run_scheduler_process
File "/usr/lib/python3.10/multiprocessing/process.py", line 108 in run
File "/usr/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 129 in _main
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main
File "<string>", line 1 in <module>
Extension modules: numpy._core._multiarray_umath, numpy.linalg._umath_linalg, charset_normalizer.md, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask, aiohttp._websocket.reader_c, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, psutil._psutil_linux, psutil._psutil_posix, zmq.backend.cython._zmq, PIL._imaging, yaml._yaml, markupsafe._speedups, PIL._imagingft, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, sentencepiece._sentencepiece, cython.cimports.libc.math, Cython.Utils, Cython.Plex.Actions, Cython.Plex.Transitions, Cython.Plex.Machines, Cython.Plex.DFA, Cython.Plex.Scanners, Cython.Compiler.Scanning, Cython.StringIOTree, Cython.Compiler.Code, uvloop.loop, setproctitle._setproctitle, cuda_utils, regex._regex (total: 52)
munmap_chunk(): invalid pointer
Fatal Python error: Aborted
Looks like
DeepEP error: timeout. Could you please check all nodes' logs to see whether there are other errors before this? Often it is caused by e.g. one node fails.
Yeah, you're right. The error is that one H20 node is out of CUDA memory.
I fell quite confused about the port number in above 4 steps, as it is shown(I basically copied it from your guide), for step1 & step2, port number is 5757, for step3, port number is 30000, for step4, port number is 8000(http://10.6.131.1:8000/), I guess something is wrong with port number. But when I changed port from 8000 to 30000 in step4, it is still not working. May you please help confirm whether the setting of port number is okay? Thanks in advance.
- --dist-init-addr: the port that is used internally for multi node to coordinate
- --port: the port to expose to end users
- but in this case, the port for prefill nodes and decode nodes should NOT be used directly. instead, users should contact the mini_lb
- mini_lb will create a port (by default 8000) that is used by normal users, and it will contact prefill nodes and decode nodes inside
So a quick glance does not seem to see wrong ports. But feel free to use the curl trick to have a check whether there is something wrong.
@tanconghui Hi, could you please show full logs? (probably in a gist etc)
Also I am wondering whether it can be caused by e.g. OOM
@fzyzcjy May you please help me here to check what is the problem? Thanks in advance.
I could successfully completed the first 3 steps: step1 on node1 : SGLANG_HACK_DEEPEP_NEW_MODE=0 SGL_ENABLE_JIT_DEEPGEMM=1 python3 -m sglang.launch_server --model-path ~/.cache/huggingface/hub/models--deepseek-ai--DeepSeek-V3/snapshots/1d044fd82b15f1cedb197a288e50cc96a2c27205/ --disaggregation-mode prefill --trust-remote-code --dist-init-addr 10.6.131.1:5757 --nnodes 1 --node-rank 0 --tp-size 8 --dp-size 8 --enable-dp-attention --enable-deepep-moe --deepep-mode normal --mem-fraction-static 0.85 --chunked-prefill-size 65536 --max-running-requests 2048 --max-total-tokens 131076 --context-length 8192 --init-expert-location /root/.cache/huggingface/attachment_ep_statistics/decode_in1000out1000.json --ep-num-redundant-experts 32 --enable-two-batch-overlap --moe-dense-tp-size 1 --disable-radix-cache > prefill.log 2>&1 &
step2 on node2 : SGLANG_HACK_DEEPEP_NEW_MODE=0 SGLANG_HACK_PD_DECODE_NUM_RESERVED_DECODE_TOKENS=102 SGL_ENABLE_JIT_DEEPGEMM=1 python3 -m sglang.launch_server --model-path ~/.cache/huggingface/hub/models--deepseek-ai--DeepSeek-V3/snapshots/1d044fd82b15f1cedb197a288e50cc96a2c27205/ --disaggregation-mode decode --trust-remote-code --dist-init-addr 10.6.131.2:5757 --nnodes 1 --node-rank 0 --tp-size 8 --dp-size 8 --enable-dp-attention --enable-deepep-moe --deepep-mode low_latency --mem-fraction-static 0.82 --max-running-requests 1024 --context-length 4500 --init-expert-location /root/.cache/huggingface/attachment_ep_statistics/decode_in1000out1000.json --enable-two-batch-overlap --moe-dense-tp-size 1 --cuda-graph-bs 128 --disable-radix-cache --decode-log-interval 1 > decoder.log 2>&1 &
step3 on node1: python3 -m sglang.srt.disaggregation.mini_lb --prefill "http://10.6.131.1:30000" --decode "http://10.6.131.2:30000" > loader.log 2>&1 &
However, when I tried to run step4, step4 on node1: python3 -m sglang.bench_one_batch_server --model-path ~/.cache/huggingface/hub/models--deepseek-ai--DeepSeek-V3/snapshots/1d044fd82b15f1cedb197a288e50cc96a2c27205/ --base-url http://10.6.131.1:8000 --batch-size 256 --input-len 4096 --output-len 5 --skip-warmup > bench.log 2>&1 &
I met following error: File "/usr/local/lib/python3.10/dist-packages/requests/models.py", line 974, in json return complexjson.loads(self.text, **kwargs) File "/usr/lib/python3.10/json/init.py", line 346, in loads return _default_decoder.decode(s) File "/usr/lib/python3.10/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/lib/python3.10/json/decoder.py", line 355, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
if I rerun step1,2,3 and tried to run bench in another way: python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompt 3000 --random-input 1000 --random-output 1000 --max-concurrency 64 --random-range-ratio 1 --output-file deepseek_v3_8xh200_FP8_online_output.jsonl --host 127.0.0.1 --port 30000
I met this error "aiohttp.client_exceptions.ClientPayloadError: Response payload is not completed: <TransferEncodingError: 400, message='Not enough data for satisfy transfer length header.'>"
SW version Info see below: sglang commit: commit 6fa2c02 (HEAD -> feat/dev_branch, origin/feat/dev_branch) Author: fzyzcjy [email protected] Date: Sun May 4 20:12:10 2025 +0800
choreDeepEP commit: commit 23ded3bd8d692755674ffb9ba18794701b6090e6 (HEAD -> patch-3, origin/patch-3) Author: fzyzcjy [email protected] Date: Tue Apr 29 09:58:31 2025 +0800
Update deep_ep.cppMoncake commit: commit 168cc22f31d91e1272661372cdc262a0157d761a (HEAD -> main, origin/main, origin/HEAD) Author: Feng Ren [email protected] Date: Wed May 7 10:19:48 2025 +0800
[DOC] Update README components (#331)
I also meet this error. : (
@Z-NAVY Hi could you please try to curl it to see what is happening
I try to run it on H20, and I also encountered the cuda graph error as follows:
2025-05-07 13:37:52 DP7 TP7] DeepGEMM JIT Compiling for <gemm_fp8_fp8_bf16_nt> M=32, N=7168, K=2048. Please wait.
[2025-05-07 13:38:35 DP5 TP5] Scheduler hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 283, in __init__
self.capture()
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 374, in capture
) = self.capture_one_batch_size(bs, forward)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 483, in capture_one_batch_size
run_once()
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 476, in run_once
logits_output = forward(input_ids, forward_batch.positions, forward_batch)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 2099, in forward
hidden_states = self.model(input_ids, positions, forward_batch, input_embeds)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1927, in forward
hidden_states, residual = self._forward_tbo_layers(
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 2016, in _forward_tbo_layers
return two_batch_overlap.model_forward_execute_two_batch(
File "/sgl-workspace/sglang/python/sglang/srt/two_batch_overlap.py", line 148, in model_forward_execute_two_batch
output_a, output_b = _execute_two_batch_raw(
File "/sgl-workspace/sglang/python/sglang/srt/two_batch_overlap.py", line 166, in _execute_two_batch_raw
executor_a.next()
File "/sgl-workspace/sglang/python/sglang/srt/two_batch_overlap.py", line 192, in next
self._stage_output = op.fn(
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 582, in _forward_tbo_op_combine_a
self.tbo_deepep_dispatchers[state.tbo_subbatch_index].combine_a(
File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/ep_moe/token_dispatcher.py", line 737, in combine_a
inner_state = self._get_impl(forward_mode).combine_a(
File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/ep_moe/token_dispatcher.py", line 622, in combine_a
hidden_states, event, hook = self._combine_core(
File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/ep_moe/token_dispatcher.py", line 640, in _combine_core
combined_hidden_states, event, hook = buffer.low_latency_combine(
File "/usr/local/lib/python3.10/dist-packages/deep_ep/buffer.py", line 533, in low_latency_combine
combined_x, event, hook = self.runtime.low_latency_combine(x, topk_idx, topk_weights, src_info, layout_range,
RuntimeError: Failed: CUDA error /sgl-workspace/DeepEP/csrc/kernels/internode_ll.cu:532 'too many blocks in cooperative launch'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2209, in run_scheduler_process
scheduler = Scheduler(
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 291, in __init__
self.tp_worker = TpWorkerClass(
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 65, in __init__
self.worker = TpModelWorker(
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 78, in __init__
self.model_runner = ModelRunner(
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 220, in __init__
self.initialize(min_per_gpu_memory)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 272, in initialize
self.init_cuda_graphs()
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1036, in init_cuda_graphs
self.cuda_graph_runner = CudaGraphRunner(self)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 285, in __init__
raise Exception(
Exception: Capture cuda graph failed: Failed: CUDA error /sgl-workspace/DeepEP/csrc/kernels/internode_ll.cu:532 'too many blocks in cooperative launch'
Possible solutions:
1. set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
2. set --cuda-graph-max-bs to a smaller value (e.g., 16)
3. disable torch compile by not using --enable-torch-compile
4. disable cuda graph by --disable-cuda-graph. (Not recommonded. Huge perf loss)
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose
@feng397 For that error, could you please try https://github.com/sgl-project/sglang/blob/38053c3372dd220911987bd8cb55b27448366497/python/sglang/srt/layers/moe/ep_moe/token_dispatcher.py#L441
@feng397 For that error, could you please try
sglang/python/sglang/srt/layers/moe/ep_moe/token_dispatcher.py
Line 441 in 38053c3
# For H20, there will be an CUDA error: DeepEP/csrc/kernels/internode_ll.cu:337 'too many blocks in cooperative launch'.
Thanks! It works! However, after I sent test request, the decode node present error as follows:
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:20000 (Press CTRL+C to quit)
[2025-05-07 14:43:04 DP9 TP9] Error fetching prefill parallel info from bootstrap: Failed to parse: http://192.168.0.108:None/route?engine_rank=-1&target_dp_group=-1
[2025-05-07 14:43:04 DP9 TP9] Scheduler hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2233, in run_scheduler_process
scheduler.event_loop_overlap_disagg_decode()
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/decode.py", line 530, in event_loop_overlap_disagg_decode
self.process_input_requests(recv_reqs)
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 810, in process_input_requests
output = self._request_dispatcher(recv_req)
File "/sgl-workspace/sglang/python/sglang/utils.py", line 471, in __call__
return fn(obj)
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 977, in handle_generate_request
self._add_request_to_queue(req)
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 984, in _add_request_to_queue
self.disagg_decode_prealloc_queue.add(req)
File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/decode.py", line 142, in add
kv_receiver = kv_receiver_class(
File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/mooncake/conn.py", line 459, in __init__
self.prefill_dp_size, tp_size_per_dp_rank = (
TypeError: cannot unpack non-iterable NoneType object
[2025-05-07 14:43:05] Child process unexpectedly failed with an exit code 131. pid=13
[2025-05-07 14:43:05] Child process unexpectedly failed with an exit code 9. pid=177
[2025-05-07 14:43:05] Child process unexpectedly failed with an exit code 9. pid=754
@fzyzcjy after rerun step1,2,3, curl server will fail:
curl http://10.6.131.1:30000/server_info
% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 curl: (7) Failed to connect to 10.6.131.1 port 30000 after 0 ms: Connection refused
@mingxiao666 Try to connect to 8000 (mini_lb)