sglang icon indicating copy to clipboard operation
sglang copied to clipboard

[Bug] run DeepSeek-R1 with --tp 2 --dp 2 --enable-dp-attention error

Open v-lmn opened this issue 10 months ago • 6 comments

Checklist

  • [x] 1. I have searched related issues but cannot get the expected help.
  • [x] 2. The bug has not been fixed in the latest version.
  • [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • [x] 5. Please use English, otherwise it will be closed.

Describe the bug

run DeepSeek-R1 with enable-dp-attention error

loc("/workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py":310:16): error: operation scheduled before its operands loc("/workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py":310:16): error: operation scheduled before its operands [2025-02-18 09:50:33 DP0 TP0] Using default MoE config. Performance might be sub-optimal! Config file not found at /workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/E=256,N=1024,device_name=NVIDIA_H20,dtype=fp8_w8a8,block_shape=[128, 128].json [2025-02-18 09:50:33 DP1 TP1] Using default MoE config. Performance might be sub-optimal! Config file not found at /workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/E=256,N=1024,device_name=NVIDIA_H20,dtype=fp8_w8a8,block_shape=[128, 128].json 13%|██████████████████████▏ | 3/23 [00:11<00:55, 2.79s/it]Failed: Cuda error /workspace/csrc/custom_all_reduce.cuh:368 'an illegal memory access was encountered' 13%|██████████████████████▏ | 3/23 [00:12<01:21, 4.08s/it] Failed: Cuda error /workspace/csrc/custom_all_reduce.cuh:368 'an illegal memory access was encountered' [2025-02-18 09:50:38] DataParallelController hit an exception: Traceback (most recent call last): File "/workspace/sglang/python/sglang/srt/managers/data_parallel_controller.py", line 236, in run_data_parallel_controller_process controller = DataParallelController(server_args, port_args) File "/workspace/sglang/python/sglang/srt/managers/data_parallel_controller.py", line 88, in init dp_port_args = self.launch_dp_attention_schedulers(server_args, port_args) File "/workspace/sglang/python/sglang/srt/managers/data_parallel_controller.py", line 139, in launch_dp_attention_schedulers self.launch_tensor_parallel_group(server_args, port_args, 0, None) File "/workspace/sglang/python/sglang/srt/managers/data_parallel_controller.py", line 192, in launch_tensor_parallel_group scheduler_info.append(scheduler_pipe_readers[i].recv()) File "/usr/lib/python3.10/multiprocessing/connection.py", line 250, in recv buf = self._recv_bytes() File "/usr/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes buf = self._recv(4) File "/usr/lib/python3.10/multiprocessing/connection.py", line 383, in _recv raise EOFError EOFError

[2025-02-18 09:50:38] Received sigquit from a child proces. It usually means the child failed.

Reproduction

python -m sglang.launch_server --model-path /data/models/DeepSeek-R1 --disable-radix-cache --trust-remote-code --tp 2 --dp 2 --enable-dp-attention --json-model-override-args '{"num_hidden_layers": 10}'

on 8*h20

Environment

root@iv-ydp5an7thcay8n6jxmz4:/workspace# python -m sglang.check_env INFO 02-18 09:55:36 init.py:194] No platform detected, vLLM is running on UnspecifiedPlatform Python: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] CUDA available: True GPU 0,1,2,3,4,5,6,7: NVIDIA H20 GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0 CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 12.4, V12.4.131 CUDA Driver Version: 535.161.08 PyTorch: 2.5.1+cu124 sgl_kernel: 0.0.3.post6 flashinfer: 0.2.1.post1+cu124torch2.5 triton: 3.1.0 transformers: 4.48.3 torchao: 0.8.0 numpy: 1.26.4 aiohttp: 3.9.5 fastapi: 0.115.8 hf_transfer: 0.1.9 huggingface_hub: 0.28.1 interegular: 0.3.3 modelscope: 1.23.0 orjson: 3.10.15 packaging: 24.0 psutil: 5.9.8 pydantic: 2.10.6 multipart: 0.0.20 zmq: 26.0.3 uvicorn: 0.34.0 uvloop: 0.21.0 vllm: 0.7.2 openai: 1.63.2 tiktoken: 0.9.0 anthropic: 0.45.2 decord: 0.6.0 NVIDIA Topology: GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 SYS PIX NODE SYS SYS 0-89 0 N/A GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 SYS PIX NODE SYS SYS 0-89 0 N/A GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 SYS NODE PIX SYS SYS 0-89 0 N/A GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 SYS NODE PIX SYS SYS 0-89 0 N/A GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS SYS PIX NODE 90-179 1 N/A GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS SYS PIX NODE 90-179 1 N/A GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS SYS NODE PIX 90-179 1 N/A GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS SYS NODE PIX 90-179 1 N/A NIC0 SYS SYS SYS SYS SYS SYS SYS SYS X SYS SYS SYS SYS NIC1 PIX PIX NODE NODE SYS SYS SYS SYS SYS X NODE SYS SYS NIC2 NODE NODE PIX PIX SYS SYS SYS SYS SYS NODE X SYS SYS NIC3 SYS SYS SYS SYS PIX PIX NODE NODE SYS SYS SYS X NODE NIC4 SYS SYS SYS SYS NODE NODE PIX PIX SYS SYS SYS NODE X

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_0 NIC1: mlx5_1 NIC2: mlx5_2 NIC3: mlx5_3 NIC4: mlx5_4

Hypervisor vendor: KVM ulimit soft: 1048576 root@iv-ydp5an7thcay8n6jxmz4:/workspace#

v-lmn avatar Feb 18 '25 09:02 v-lmn

anybody have the same problem? please help

v-lmn avatar Feb 18 '25 09:02 v-lmn

If I set the command flag '--tp 4 --dp 2' or '--tp 8 --dp 8',the server can be launched success,why?

v-lmn avatar Feb 18 '25 10:02 v-lmn

This error stems from a bug in our custom Triton kernel (in the decode_attention module) used for data-parallel attention. With the configuration –tp 2 –dp 2 –enable-dp-attention, the kernel ends up scheduling an operation before its operands are ready, which triggers an illegal memory access. In configurations with higher tensor parallelism (for example, –tp 4 or –tp 8), the work is partitioned differently so that the kernel’s dependency ordering is maintained, and the error doesn’t occur.

In short, the issue is specific to the lower TP setting (tp=2) combined with dp-attention, likely due to how the custom all-reduce and attention kernels manage dependencies on NVIDIA H20 GPUs. As a temporary workaround, using a higher TP value (or disabling dp-attention) avoids the problematic code path. We’re actively investigating this bug and hope to have a fix soon.

jhinpan avatar Feb 18 '25 17:02 jhinpan

This error stems from a bug in our custom Triton kernel (in the decode_attention module) used for data-parallel attention. With the configuration –tp 2 –dp 2 –enable-dp-attention, the kernel ends up scheduling an operation before its operands are ready, which triggers an illegal memory access. In configurations with higher tensor parallelism (for example, –tp 4 or –tp 8), the work is partitioned differently so that the kernel’s dependency ordering is maintained, and the error doesn’t occur.

In short, the issue is specific to the lower TP setting (tp=2) combined with dp-attention, likely due to how the custom all-reduce and attention kernels manage dependencies on NVIDIA H20 GPUs. As a temporary workaround, using a higher TP value (or disabling dp-attention) avoids the problematic code path. We’re actively investigating this bug and hope to have a fix soon.

thanks,I have other questions why use '--tp 8 --dp 8' it also work,I only have 8 gpus,I dont understand how is the MLA part executed when using TP8 + DP8? Does the MLA part get replicated 8 times?When I load the complete model(without --json-model-override-args '{"num_hidden_layers": 10}'), load failed. Is it because the MoE layer with TP8 has already used up the GPU memory, leaving no memory for MLA with DP8?

and I found that the code

Image why self.dp_size = self.tp_size @jhinpan

v-lmn avatar Feb 19 '25 02:02 v-lmn

mark! the same problem with tp 16 on 2 H800*8 source version 3c7bfd7eabed5e29cf907dba3e2ed875d7a92fd4

YEXINGZHE54 avatar Feb 20 '25 10:02 YEXINGZHE54

@v-lmn

why self.dp_size = self.tp_size

Currently, DP and TP attention cannot be combinable. If you set --tp 8 --enable-dp-attention, it will only use 8-way data parallelism for MLA part. It's in our plan to make it more flexible.

Does the MLA part get replicated 8 times?When I load the complete model(without --json-model-override-args '{"num_hidden_layers": 10}'), load failed. Is it because the MoE layer with TP8 has already used up the GPU memory, leaving no memory for MLA with DP8?

In DP attention, the weights of the attention part will be replicated across GPUs, so the GPU memory usage will be a little higher. While for most of the MoE model, the weights of the attention part only takes a small proportion. (e.g. for v2 model, it's only ~3%)

ispobock avatar Feb 21 '25 09:02 ispobock

mark

fyuan1316 avatar Apr 13 '25 14:04 fyuan1316

This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.

github-actions[bot] avatar Jun 13 '25 00:06 github-actions[bot]