[Bug] Why enable_dp_attention is much slower when running DeepSeekV3 on 8xH200
Checklist
- [ ] 1. I have searched related issues but cannot get the expected help.
- [ ] 2. The bug has not been fixed in the latest version.
- [ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [ ] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [ ] 5. Please use English, otherwise it will be closed.
Describe the bug
According to the dp-attention performance & usage, I turn on it by --enable-dp-attention when launching DeepSeek v3 on 8xH200. The command is like as below:
docker run -d --gpus all --privileged --ipc=host --net=host -v /models:/models --name sgl_test --entrypoint /usr/bin/python3 -e CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 lmsysorg/sglang:v0.4.1.post4-cu124 -m sglang.launch_server --model-path /models/DeepSeek-V3 --served-model-name deepseek-v3 --tensor-parallel-size 8 --trust-remote-code --enable-dp-attention --host 0.0.0.0 --port 40000 --max-total-tokens 65536
When I run benchmark serving tests and find the output throughput is only half of the number without "--enable-dp-attention". E.g. the TPOT is about 60ms if "--enable-dp-attention" is set, but TPOT is 37ms if "--enable-dp-attention" is removed. It is not matched with the official result.
Reproduction
docker run -d --gpus all --privileged --ipc=host --net=host -v /models:/models --name sgl_test --entrypoint /usr/bin/python3 -e CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 lmsysorg/sglang:v0.4.1.post4-cu124 -m sglang.launch_server --model-path /models/DeepSeek-V3 --served-model-name deepseek-v3 --tensor-parallel-size 8 --trust-remote-code --enable-dp-attention --host 0.0.0.0 --port 40000 --max-total-tokens 65536
Environment
H200*8 Python: 3.10.16 (main, Dec 4 2024, 08:53:37) [GCC 9.4.0] CUDA available: True GPU 0,1,2,3,4,5,6,7: NVIDIA H200 GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0 CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 12.4, V12.4.131 CUDA Driver Version: 550.127.05 PyTorch: 2.5.1+cu124 flashinfer: 0.1.6+cu124torch2.4 triton: 3.1.0 transformers: 4.47.1 torchao: 0.7.0 numpy: 1.26.4 aiohttp: 3.11.11 fastapi: 0.115.6 hf_transfer: 0.1.8 huggingface_hub: 0.27.0 interegular: 0.3.3 modelscope: 1.21.1 orjson: 3.10.13 packaging: 24.2 psutil: 6.1.1 pydantic: 2.10.4 multipart: 0.0.20 zmq: 26.2.0 uvicorn: 0.34.0 uvloop: 0.21.0 vllm: 0.6.4.post1 openai: 1.59.3 anthropic: 0.42.0 decord: 0.6.0 NVIDIA Topology: GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8 NIC9 NIC10 NIC11 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 PIX NODE NODE SYS SYS SYS SYS SYS SYS SYS SYS SYS 0-23,96-119 0 N/A GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 NODE PIX NODE SYS SYS SYS SYS SYS SYS SYS SYS SYS 0-23,96-119 0 N/A GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 NODE NODE PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS 0-23,96-119 0 N/A GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 SYS SYS SYS PIX SYS SYS SYS SYS SYS SYS SYS SYS 24-47,120-143 1 N/A GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS SYS SYS PIX NODE NODE SYS SYS SYS SYS SYS 48-71,144-167 2 N/A GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS SYS SYS NODE PIX NODE SYS SYS SYS SYS SYS 48-71,144-167 2 N/A GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS SYS SYS NODE NODE PIX SYS SYS SYS SYS SYS 48-71,144-167 2 N/A GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS SYS SYS SYS SYS SYS NODE NODE NODE NODE PIX 72-95,168-191 3 N/A NIC0 PIX NODE NODE SYS SYS SYS SYS SYS X NODE NODE SYS SYS SYS SYS SYS SYS SYS SYS SYS NIC1 NODE PIX NODE SYS SYS SYS SYS SYS NODE X NODE SYS SYS SYS SYS SYS SYS SYS SYS SYS NIC2 NODE NODE PIX SYS SYS SYS SYS SYS NODE NODE X SYS SYS SYS SYS SYS SYS SYS SYS SYS NIC3 SYS SYS SYS PIX SYS SYS SYS SYS SYS SYS SYS X SYS SYS SYS SYS SYS SYS SYS SYS NIC4 SYS SYS SYS SYS PIX NODE NODE SYS SYS SYS SYS SYS X NODE NODE SYS SYS SYS SYS SYS NIC5 SYS SYS SYS SYS NODE PIX NODE SYS SYS SYS SYS SYS NODE X NODE SYS SYS SYS SYS SYS NIC6 SYS SYS SYS SYS NODE NODE PIX SYS SYS SYS SYS SYS NODE NODE X SYS SYS SYS SYS SYS NIC7 SYS SYS SYS SYS SYS SYS SYS NODE SYS SYS SYS SYS SYS SYS SYS X PIX PXB PXB NODE NIC8 SYS SYS SYS SYS SYS SYS SYS NODE SYS SYS SYS SYS SYS SYS SYS PIX X PXB PXB NODE NIC9 SYS SYS SYS SYS SYS SYS SYS NODE SYS SYS SYS SYS SYS SYS SYS PXB PXB X PIX NODE NIC10 SYS SYS SYS SYS SYS SYS SYS NODE SYS SYS SYS SYS SYS SYS SYS PXB PXB PIX X NODE NIC11 SYS SYS SYS SYS SYS SYS SYS PIX SYS SYS SYS SYS SYS SYS SYS NODE NODE NODE NODE X
Legend:
X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0 NIC1: mlx5_1 NIC2: mlx5_2 NIC3: mlx5_3 NIC4: mlx5_4 NIC5: mlx5_5 NIC6: mlx5_6 NIC7: mlx5_7 NIC8: mlx5_8 NIC9: mlx5_9 NIC10: mlx5_10 NIC11: mlx5_11
ulimit soft: 1048576
How many request rate did you set in the benchmark? --enable-dp-attention can improve throughput for large QPS scenarios.
How many request rate did you set in the benchmark?
--enable-dp-attentioncan improve throughput for large QPS scenarios.
@ispobock So at what QPS does --enable-dp-attention improve throughput?
This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.