[Feature] DeepSeek V3 optimization
Checklist
- [ ] 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [ ] 2. Please use English, otherwise it will be closed.
Adoption
SGLang adoption for DeepSeek V3 and R1
Usage
User Guide for Existing System (Installation & Launch)
https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3
Please use the latest version v0.4.2.post4. Please prefer to use docker image. docker pull lmsysorg/sglang:latest
For running on AMD MI300X, use this as a reference. Running DeepSeek-R1 on a single NDv5 MI300X VM
Features
- [x] Support CUDA Graph @HandH1998 @ispobock
- [x] Support Torch compile @ispobock
- [x] Use BF16 for bmm @zhyncs
- [x] Improve the accuracy for FP8 @HandH1998 @zhyncs @ispobock
- [x] Tuning FP8 GEMM @HandH1998 @zhyncs
- [x] Replace
moe_align_block_size@HandH1998 @zhyncs @BBuf - [x] FusedMoE tuning for H200
E=256,N=256,device_name=NVIDIA_H200,dtype=fp8_w8a8.json@BBuf - [x] TP+DP Attention @Ying1123
- [x] Support overlap scheduler with DP attention @merrymercy
- [x] Fuse Sigmoid Gate moe_kernels.cu @NovTi @BBuf (torch compile is sufficient for this use case, so the priority and ROI to support it are not high. Closing for now.)
- [x] Support
nextnspeculative decoding @ispobock https://github.com/sgl-project/sglang/issues/3472 - [x] FP8 GEMM CUTLASS implementation @yizhang2077
- [x] Better fused_experts @bbuf @zhyncs
- [x] FlashInfer Prefill and MLA Decoding @zhyncs @ispobock
- [ ] FP8 GEMM Composable Kernel implementation @HaiShaw
- [ ] Support Pipeline Parallelism @Ying1123
Related resources
No response
Very quick response ! I understand that the overlap scheduler is model-independent and is a general optimization that should be supported by default. At least some special optimizations are needed?
The overlap scheduler is model-independent but has not been supported when using dp attention. We have a private branch for this and will upstream it soon.
Is the memory sufficient for an 8 gpus instance? This model size is too large.
Is the memory sufficient for an 8 gpus instance? This model size is too large.
671B works on H200 * 8 with FP8 (671 < 141 * 8)
Hi @fengyang95 You can also consider multi node.
If you do not have GPUs with large enough memory, please try multi-node tensor parallelism (help 1 help 2).
FYI Due to the tight schedule, SGLang v0.4.1 currently only provides preliminary support for DeepSeek V3. To make it run more cost-efficiently, we need to complete most of the optimizations mentioned above. If you are interested in any of the above optimizations, feel free to join the SGLang Slack for discussions or contribute a PR. We hope to complete these optimizations quickly and appreciate any discussion and contributions.
Update: SGLang v0.4.1.post1 supports CUDA Graph for DeepSeek V3, please use the latest version.
pip install "sglang[all]==0.4.1.post1" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer
Update: SGLang v0.4.1.post2 supports FP8 GEMM Tuning for DeepSeek V3, please use the latest version.
pip install "sglang[all]==0.4.1.post2" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer
ref https://github.com/sgl-project/sglang/pull/2647
plan to support mtp?
plan to support mtp?
It's on the roadmap and it's named nextn. We'll support it soon.
@zhyncs @Ying1123 @merrymercy ,hello, As you mentioned above, TP+DP,
TP+DP Attention @Ying1123
I have two questions, could you help me answer them?
1.Can we decouple TP and DP after this implementation? Can we configure the scenario where DP is not equal to TP?
2.Is there a detailed schedule for the mentioned above? Are there any related supporting design documents that can be shared?
I had another question regarding DP attention. The sglang blog mentions that DP attention is effective because of the MLA has only 1 KV head, which causes unnecessary duplication of KV caches. DeepSeek-V3 MLA has more KV heads (16 attention, 128 KV), so do we still replicate KV caches if just using something like TP8 or TP16? I understand there might not be sufficient heads if the deployment is large.
+1 for shared design docs, if possible.
I had another question regarding DP attention. The sglang blog mentions that DP attention is effective because of the MLA has only 1 KV head, which causes unnecessary duplication of KV caches. DeepSeek-V3 MLA has more KV heads (16 attention, 128 KV), so do we still replicate KV caches if just using something like TP8 or TP16? I understand there might not be sufficient heads if the deployment is large.
+1 for shared design docs, if possible.
@zhyncs @Mutinifni
https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/main/config.json "num_key_value_heads": 128, DeepSeek-V3 has 128 KV heads??
Are there any data related to inference time batch size and token imbalance between experts? What's the total throughput like for a 8xH200 node?
Has there been any progress with the support from NextN?
The overlap scheduler with DP attention can not be used on A800 * 4., because always OOM.
Is there a plan to support TP + SP attention?
The paper says "The attention part employs 4-way Tensor Parallelism (TP4) with Sequence Parallelism (SP)"
Can you support deepseek R1 Q4KM GGUF file,https://huggingface.co/unsloth/DeepSeek-R1-GGUF
Any process about nextn speculative decoding?
Hi @ispobock, wonder if you have any timeline to share nextn (speculative decoding) will be supported? thanks
We have finished spec module refactor and will support nextn in the next 1~2 weeks.
We have finished spec module refactor and will support
nextnin the next 1~2 weeks.
Thanks! I wonder if your implementation will include any mechanism to generate the acceptance rates of the MTP head?
DeepSeek MTP spec decode #12755 is Implement DeepSeek MTP: https://github.com/vllm-project/vllm/issues/12181 to support DeepSeek MTP layers for next n prediction.
Does sglang now support deepseekv3 inference with EP>1? When I added --enable-ep-moe to the command to start the service, I found that the process would hang. I'm not sure if this is a problem caused by my environment or if this feature is not currently supported.
Does sglang now support deepseekv3 inference with EP>1? When I added --enable-ep-moe to the command to start the service, I found that the process would hang. I'm not sure if this is a problem caused by my environment or if this feature is not currently supported.
Me too, it seems not support ep-moe in 0.4.2
Need help here, If you are familiar with CUDA optimization and have ideas about this issue, feel free to contact me.
When will mtp(nextn speculative decoding) be supported?
This is https://github.com/CentML's implementation of DeepSeek MTP modules that enable speculative decoding for DeepSeek-R1. https://github.com/vllm-project/vllm/pull/12915
hi, I am new to sglang,and I need help about deploying deepseek on two nodes: https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-two-h2008-nodes-and-docker
I have two H100 nodes,what is the best setting at parameters for the throughout?
thanks