[Bug] when dp<ep, deepep moe occurs an error when running on 4*H800 nodes
Checklist
- [x] 1. I have searched related issues but cannot get the expected help.
- [x] 2. The bug has not been fixed in the latest version.
- [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [x] 5. Please use English, otherwise it will be closed.
Describe the bug
In the 4x8xH800 environment, when dp=8 and tp=32, the following error occurs: `[2025-04-23 03:12:13 DP1 TP5] Scheduler hit an exception: Traceback (most recent call last): File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 275, in init self.capture() File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 359, in capture ) = self.capture_one_batch_size(bs, forward) File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 451, in capture_one_batch_size run_once() File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 444, in run_once logits_output = forward(input_ids, forward_batch.positions, forward_batch) File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1466, in forward hidden_states = self.model(input_ids, positions, forward_batch, input_embeds) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(*args, **kwargs) File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1390, in forward hidden_states, residual = layer( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(*args, **kwargs) File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1173, in forward return self.forward_ffn_with_scattered_input( File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1293, in forward_ffn_with_scattered_input hidden_states, residual = self.post_attention_layernorm( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(*args, **kwargs) File "/sgl-workspace/sglang/python/sglang/srt/custom_op.py", line 18, in forward return self._forward_method(*args, **kwargs) File "/sgl-workspace/sglang/python/sglang/srt/layers/layernorm.py", line 71, in forward_cuda fused_add_rmsnorm(x, residual, self.weight.data, self.variance_epsilon) File "/usr/local/lib/python3.10/dist-packages/sgl_kernel/elementwise.py", line 74, in fused_add_rmsnorm torch.ops.sgl_kernel.fused_add_rmsnorm.default( File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 723, in call return self._op(*args, **kwargs) RuntimeError: CHECK_EQ(input.size(0), residual.size(0)) failed. 32 vs 128
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2001, in run_scheduler_process scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 261, in init self.tp_worker = TpWorkerClass( File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in init self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port) File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 75, in init self.model_runner = ModelRunner( File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 181, in init self.initialize(min_per_gpu_memory) File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 219, in initialize self.init_cuda_graphs() File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 980, in init_cuda_graphs self.cuda_graph_runner = CudaGraphRunner(self) File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 277, in init raise Exception( Exception: Capture cuda graph failed: CHECK_EQ(input.size(0), residual.size(0)) failed. 32 vs 128`
Reproduction
python3 -m sglang.launch_server --model-path /code/llm-benchmark-script/data/raw/DeepSeek-V3 --host 0.0.0.0 --port 6178 --tp 32 --dp 8 --enable-dp-attention --disable-radix-cache --trust-remote-code --chunked-prefill-size 4096 --enable-deepep-moe --max-running-requests 128 --disable-radix-cache --mem-fraction-static 0.8 --stream-output --deepep-mode low_latency --moe-dense-tp-size 1 --cuda-graph-max-bs 128 --dist-init-addr xxxx:20000 --nnodes 4 --node-rank 0
Environment
4x8xH800
Could you please check if #5657 resolves your issue? Here is the command I used to verify this PR, which works well:
# Node 1:
python3 -m sglang.launch_server --model-path /dev/shm/DeepSeek-V3-0324 --trust-remote-code \
--tp 16 --dp 4 --host 0.0.0.0 --port 30000 --dist-init-addr 10.10.38.4:5000 --nnodes 2 --node-rank 0 \
--enable-dp-attention --enable-deepep-moe --deepep-mode normal \
--max-running-requests 2048 --disable-radix-cache --mem-fraction-static 0.9 --stream-output \
--disable-cuda-graph
# Node 2:
python3 -m sglang.launch_server --model-path /dev/shm/DeepSeek-V3-0324 --trust-remote-code \
--tp 16 --dp 4 --host 0.0.0.0 --port 30000 --dist-init-addr 10.10.38.4:5000 --nnodes 2 --node-rank 1 \
--enable-dp-attention --enable-deepep-moe --deepep-mode normal \
--max-running-requests 2048 --disable-radix-cache --mem-fraction-static 0.9 --stream-output \
--disable-cuda-graph
Could you please check if #5657 resolves your issue? Here is the command I used to verify this PR, which works well:
# Node 1: python3 -m sglang.launch_server --model-path /dev/shm/DeepSeek-V3-0324 --trust-remote-code \ --tp 16 --dp 4 --host 0.0.0.0 --port 30000 --dist-init-addr 10.10.38.4:5000 --nnodes 2 --node-rank 0 \ --enable-dp-attention --enable-deepep-moe --deepep-mode normal \ --max-running-requests 2048 --disable-radix-cache --mem-fraction-static 0.9 --stream-output \ --disable-cuda-graph# Node 2: python3 -m sglang.launch_server --model-path /dev/shm/DeepSeek-V3-0324 --trust-remote-code \ --tp 16 --dp 4 --host 0.0.0.0 --port 30000 --dist-init-addr 10.10.38.4:5000 --nnodes 2 --node-rank 1 \ --enable-dp-attention --enable-deepep-moe --deepep-mode normal \ --max-running-requests 2048 --disable-radix-cache --mem-fraction-static 0.9 --stream-output \ --disable-cuda-graph
@ch-wan Thank you very much. The error message disappeared after merge pr, but it got stuck after sending the request, while it can run normally when dp=ep.
Error stack after timeout:
Thread 1602908 (active): "MainThread"
synchronize (torch/cuda/streams.py:224)
resolve_batch_result (tp_worker_overlap_thread.py:173)
process_batch_result_prefill (scheduler_output_processor_mixin.py:46)
process_batch_result (scheduler.py:1413)
event_loop_overlap (scheduler.py:663)
decorate_context (torch/utils/_contextlib.py:116)
run_scheduler_process (scheduler.py:2021)
run (multiprocessing/process.py:108)
_bootstrap (multiprocessing/process.py:314)
_main (multiprocessing/spawn.py:129)
spawn_main (multiprocessing/spawn.py:116)
<module> (<string>:1)
Thread 1603557 (idle): "Thread-1 (_read_thread)"
_recv_msg (torch/_inductor/compile_worker/subproc_pool.py:53)
_read_thread (torch/_inductor/compile_worker/subproc_pool.py:161)
run (threading.py:953)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
Thread 1605095 (idle): "Thread-2"
wait (threading.py:324)
wait (threading.py:607)
run (tqdm/_monitor.py:60)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
Thread 1605110 (idle): "Thread-3"
wait (threading.py:324)
wait (threading.py:607)
run (tqdm/_monitor.py:60)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
Thread 1605577 (idle): "Thread-4 (forward_thread_func)"
Could you please share the command you used?
Could you please share the command you used? @ch-wan OK, both 2 nodes and 4 nodes can reproduce.
node0:
python3 -m sglang.launch_server --model-path /code/llm-benchmark-script/data/raw/DeepSeek-V3 --host 0.0.0.0 --port 6178 --tp 32 --dp 8 --enable-dp-attention --disable-radix-cache --trust-remote-code --chunked-prefill-size 4096 --enable-deepep-moe --max-running-requests 128 --disable-radix-cache --mem-fraction-static 0.8 --stream-output --deepep-mode low_latency --moe-dense-tp-size 1 --cuda-graph-max-bs 128 --dist-init-addr xxxxx:20000 --nnodes 4 --node-rank 0
node1:
python3 -m sglang.launch_server --model-path /code/llm-benchmark-script/data/raw/DeepSeek-V3 --host 0.0.0.0 --port 6178 --tp 32 --dp 8 --enable-dp-attention --disable-radix-cache --trust-remote-code --chunked-prefill-size 4096 --enable-deepep-moe --max-running-requests 128 --disable-radix-cache --mem-fraction-static 0.8 --stream-output --deepep-mode low_latency --moe-dense-tp-size 1 --cuda-graph-max-bs 128 --dist-init-addr xxxxx:20000 --nnodes 4 --node-rank 1
node2:
python3 -m sglang.launch_server --model-path /code/llm-benchmark-script/data/raw/DeepSeek-V3 --host 0.0.0.0 --port 6178 --tp 32 --dp 8 --enable-dp-attention --disable-radix-cache --trust-remote-code --chunked-prefill-size 4096 --enable-deepep-moe --max-running-requests 128 --disable-radix-cache --mem-fraction-static 0.8 --stream-output --deepep-mode low_latency --moe-dense-tp-size 1 --cuda-graph-max-bs 128 --dist-init-addr xxxxx:20000 --nnodes 4 --node-rank 2
node3:
python3 -m sglang.launch_server --model-path /code/llm-benchmark-script/data/raw/DeepSeek-V3 --host 0.0.0.0 --port 6178 --tp 32 --dp 8 --enable-dp-attention --disable-radix-cache --trust-remote-code --chunked-prefill-size 4096 --enable-deepep-moe --max-running-requests 128 --disable-radix-cache --mem-fraction-static 0.8 --stream-output --deepep-mode low_latency --moe-dense-tp-size 1 --cuda-graph-max-bs 128 --dist-init-addr xxxxx:20000 --nnodes 4 --node-rank 3