DeepEP icon indicating copy to clipboard operation
DeepEP copied to clipboard

Setting deepep to low_latency mode fails to start model serving

Open ch-tiger1 opened this issue 7 months ago • 5 comments

When I set the deepep mode to auto and normal, the stress test is normal, but when I compare the low_latency mode, the following error occurs. RuntimeError: Failed: Assertion error /sgl-workspace/DeepEP/csrc/deep_ep.cpp:1025 'x.size(0) == topk_idx.size(0) and x.size(0) <= num_max_dispatch_tokens_per_rank' Has anyone encountered this before? What should I do to solve it?

[2025-05-14 03:58:33 DP0 TP0] TpModelWorkerClient hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 116, in forward_thread_func
    self.forward_thread_func_()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 147, in forward_thread_func_
    logits_output, next_token_ids = self.worker.forward_batch_generation(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 181, in forward_batch_generation
    logits_output = self.model_runner.forward(forward_batch)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1122, in forward
    return self._forward_raw(forward_batch, skip_attn_backend_init)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1155, in _forward_raw
    return self.forward_extend(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1064, in forward_extend
    return self.model.forward(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 2099, in forward
    hidden_states = self.model(input_ids, positions, forward_batch, input_embeds)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1916, in forward
    hidden_states, residual = layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1462, in forward
    return self.forward_ffn_with_scattered_input(
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1617, in forward_ffn_with_scattered_input
    hidden_states = self.mlp(hidden_states, forward_batch.forward_mode)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 355, in forward
    return self.forward_deepep(hidden_states, forward_mode)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 386, in forward_deepep
    self._forward_deepep_dispatch_a(
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 466, in _forward_deepep_dispatch_a
    chosen_deepep_dispatcher.dispatch_a(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/ep_moe/token_dispatcher.py", line 713, in dispatch_a
    inner_state = self._get_impl(forward_mode).dispatch_a(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/ep_moe/token_dispatcher.py", line 519, in dispatch_a
    hidden_states, masked_m, event, hook = self._dispatch_core(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/ep_moe/token_dispatcher.py", line 603, in _dispatch_core
    buffer.low_latency_dispatch(
  File "/usr/local/lib/python3.10/dist-packages/deep_ep/buffer.py", line 487, in low_latency_dispatch
    self.runtime.low_latency_dispatch(x, topk_idx,
RuntimeError: Failed: Assertion error /sgl-workspace/DeepEP/csrc/deep_ep.cpp:1025 'x.size(0) == topk_idx.size(0) and x.size(0) <= num_max_dispatch_tokens_per_rank'

[2025-05-14 03:58:33] Child process unexpectedly failed with an exit code 131. pid=255
[2025-05-14 03:58:34] Child process unexpectedly failed with an exit code 9. pid=1401
[2025-05-14 03:58:34] Child process unexpectedly failed with an exit code 9. pid=2194
[2025-05-14 03:58:34] Child process unexpectedly failed with an exit code 9. pid=2198

cmd

export SGL_ENABLE_JIT_DEEPGEMM=1
python3 -m sglang.launch_server \
        --model $model \
        --trust-remote-code \
        --tp 8 \
        --dp 8 \
        --enable-dp-attention \
        --enable-deepep-moe \
        --deepep-mode low_latency \
        --enable-metrics

ch-tiger1 avatar May 14 '25 07:05 ch-tiger1

Maybe you should increase this value num_max_dispatch_tokens_per_rank

alpha-baby avatar May 14 '25 07:05 alpha-baby

Maybe you should increase this value num_max_dispatch_tokens_per_rank

Thanks. Is there any standard I can refer to? What is the appropriate setting? I see that the default value of this variable is 128, and the maximum value is no more than 256

ch-tiger1 avatar May 14 '25 09:05 ch-tiger1

num_max_dispatch_tokens_per_rank means the maximum tokens to send in a single batch (must be consistent across all ranks).

So the "appropriate setting" would be the batch size per dispatch of SGLang, I think you should file a bug to the SGLang team :)

LyricZhao avatar May 14 '25 09:05 LyricZhao

num_max_dispatch_tokens_per_rank means the maximum tokens to send in a single batch (must be consistent across all ranks).

So the "appropriate setting" would be the batch size per dispatch of SGLang, I think you should file a bug to the SGLang team :)

Yes, I don't know the specific reason at the moment, so I also added my question under this issue

ch-tiger1 avatar May 14 '25 09:05 ch-tiger1

SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=256

allowed me to successfully skip the issue. Everything works fine now.

zheng1 avatar Oct 22 '25 15:10 zheng1