sglang [Feature] DeepSeek V3 optimization

Checklist

[ ] 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
[ ] 2. Please use English, otherwise it will be closed.

Adoption

SGLang adoption for DeepSeek V3 and R1

Usage

User Guide for Existing System (Installation & Launch)

https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3

Please use the latest version v0.4.2.post4. Please prefer to use docker image. docker pull lmsysorg/sglang:latest

For running on AMD MI300X, use this as a reference. Running DeepSeek-R1 on a single NDv5 MI300X VM

Features

[x] Support CUDA Graph @HandH1998 @ispobock
[x] Support Torch compile @ispobock
[x] Use BF16 for bmm @zhyncs
[x] Improve the accuracy for FP8 @HandH1998 @zhyncs @ispobock
[x] Tuning FP8 GEMM @HandH1998 @zhyncs
[x] Replace moe_align_block_size @HandH1998 @zhyncs @BBuf
[x] FusedMoE tuning for H200 E=256,N=256,device_name=NVIDIA_H200,dtype=fp8_w8a8.json @BBuf
[x] TP+DP Attention @Ying1123
[x] Support overlap scheduler with DP attention @merrymercy
[x] Fuse Sigmoid Gate moe_kernels.cu @NovTi @BBuf (torch compile is sufficient for this use case, so the priority and ROI to support it are not high. Closing for now.)
[x] Support nextn speculative decoding @ispobock https://github.com/sgl-project/sglang/issues/3472
[x] FP8 GEMM CUTLASS implementation @yizhang2077
[x] Better fused_experts @bbuf @zhyncs
[x] FlashInfer Prefill and MLA Decoding @zhyncs @ispobock
[ ] FP8 GEMM Composable Kernel implementation @HaiShaw
[ ] Support Pipeline Parallelism @Ying1123

Related resources

No response

Dec 26 '24 08:12 zhyncs

Very quick response ! I understand that the overlap scheduler is model-independent and is a general optimization that should be supported by default. At least some special optimizations are needed?

Dec 26 '24 10:12 libratiger

The overlap scheduler is model-independent but has not been supported when using dp attention. We have a private branch for this and will upstream it soon.

Dec 26 '24 11:12 merrymercy

Is the memory sufficient for an 8 gpus instance? This model size is too large.

Dec 26 '24 13:12 fengyang95

Is the memory sufficient for an 8 gpus instance? This model size is too large.

671B works on H200 * 8 with FP8 (671 < 141 * 8)

Dec 26 '24 15:12 zhyncs

Hi @fengyang95 You can also consider multi node.

If you do not have GPUs with large enough memory, please try multi-node tensor parallelism (help 1 help 2).

Dec 26 '24 16:12 zhyncs

FYI Due to the tight schedule, SGLang v0.4.1 currently only provides preliminary support for DeepSeek V3. To make it run more cost-efficiently, we need to complete most of the optimizations mentioned above. If you are interested in any of the above optimizations, feel free to join the SGLang Slack for discussions or contribute a PR. We hope to complete these optimizations quickly and appreciate any discussion and contributions.

Dec 26 '24 18:12 zhyncs

Update: SGLang v0.4.1.post1 supports CUDA Graph for DeepSeek V3, please use the latest version.

pip install "sglang[all]==0.4.1.post1" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer

Dec 27 '24 17:12 zhyncs

Update: SGLang v0.4.1.post2 supports FP8 GEMM Tuning for DeepSeek V3, please use the latest version.

pip install "sglang[all]==0.4.1.post2" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer

Dec 29 '24 16:12 zhyncs

ref https://github.com/sgl-project/sglang/pull/2647

Dec 30 '24 05:12 zhyncs

plan to support mtp?

Jan 06 '25 05:01 CSEEduanyu

plan to support mtp?

It's on the roadmap and it's named nextn. We'll support it soon.

Jan 06 '25 18:01 zhyncs

@zhyncs @Ying1123 @merrymercy ,hello, As you mentioned above, TP+DP,

TP+DP Attention @Ying1123

I have two questions, could you help me answer them?

1.Can we decouple TP and DP after this implementation? Can we configure the scenario where DP is not equal to TP?

2.Is there a detailed schedule for the mentioned above? Are there any related supporting design documents that can be shared?

Jan 08 '25 02:01 lixiaolx

I had another question regarding DP attention. The sglang blog mentions that DP attention is effective because of the MLA has only 1 KV head, which causes unnecessary duplication of KV caches. DeepSeek-V3 MLA has more KV heads (16 attention, 128 KV), so do we still replicate KV caches if just using something like TP8 or TP16? I understand there might not be sufficient heads if the deployment is large.

+1 for shared design docs, if possible.

Jan 10 '25 21:01 mutinifni

I had another question regarding DP attention. The sglang blog mentions that DP attention is effective because of the MLA has only 1 KV head, which causes unnecessary duplication of KV caches. DeepSeek-V3 MLA has more KV heads (16 attention, 128 KV), so do we still replicate KV caches if just using something like TP8 or TP16? I understand there might not be sufficient heads if the deployment is large.

+1 for shared design docs, if possible.

@zhyncs @Mutinifni

https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/main/config.json "num_key_value_heads": 128, DeepSeek-V3 has 128 KV heads??

Jan 13 '25 07:01 pipul

Are there any data related to inference time batch size and token imbalance between experts? What's the total throughput like for a 8xH200 node?

Jan 14 '25 18:01 min-xu-et

Has there been any progress with the support from NextN?

Jan 19 '25 14:01 CSEEduanyu

The overlap scheduler with DP attention can not be used on A800 * 4., because always OOM.

Jan 22 '25 06:01 lambert0312

Is there a plan to support TP + SP attention?

The paper says "The attention part employs 4-way Tensor Parallelism (TP4) with Sequence Parallelism (SP)"

Jan 22 '25 10:01 MtFitzRoy

Can you support deepseek R1 Q4KM GGUF file，https://huggingface.co/unsloth/DeepSeek-R1-GGUF

Jan 26 '25 08:01 wuyaoxuehun

Any process about nextn speculative decoding?

Jan 27 '25 06:01 Xu-Chen

Hi @ispobock, wonder if you have any timeline to share nextn (speculative decoding) will be supported? thanks

Feb 03 '25 17:02 Neo9061

We have finished spec module refactor and will support nextn in the next 1~2 weeks.

Feb 04 '25 05:02 ispobock

We have finished spec module refactor and will support nextn in the next 1~2 weeks.

Thanks! I wonder if your implementation will include any mechanism to generate the acceptance rates of the MTP head?

Feb 04 '25 21:02 Neo9061

DeepSeek MTP spec decode #12755 is Implement DeepSeek MTP: https://github.com/vllm-project/vllm/issues/12181 to support DeepSeek MTP layers for next n prediction.

Feb 05 '25 02:02 lambert0312

Does sglang now support deepseekv3 inference with EP>1? When I added --enable-ep-moe to the command to start the service, I found that the process would hang. I'm not sure if this is a problem caused by my environment or if this feature is not currently supported.

Feb 05 '25 09:02 yukavio

Does sglang now support deepseekv3 inference with EP>1? When I added --enable-ep-moe to the command to start the service, I found that the process would hang. I'm not sure if this is a problem caused by my environment or if this feature is not currently supported.

Me too, it seems not support ep-moe in 0.4.2

Feb 05 '25 15:02 jianglan89

Need help here, If you are familiar with CUDA optimization and have ideas about this issue, feel free to contact me.

Feb 06 '25 09:02 BBuf

When will mtp(nextn speculative decoding) be supported?

Feb 07 '25 03:02 01lin

This is https://github.com/CentML's implementation of DeepSeek MTP modules that enable speculative decoding for DeepSeek-R1. https://github.com/vllm-project/vllm/pull/12915

Feb 08 '25 02:02 lambert0312

hi, I am new to sglang，and I need help about deploying deepseek on two nodes： https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-two-h2008-nodes-and-docker

I have two H100 nodes，what is the best setting at parameters for the throughout？

thanks

Feb 08 '25 09:02 ltm920716