[Track] DeepSeek V3/R1 nextn progress
Triton Backend
@ispobock @pankajroark
-
[x] support EAGLE 2
-
[ ] support next II (multi MTP heads) (WIP @pankajroark )
FlashInfer Backend
@zhyncs @yzh119
-
[x] compatible with disable MLA
-
[x] support FlashInfer nightly MLA ragged prefill and CUDA Core MLA decoding
-
[x] support FlashInfer v0.2.0.post3 MLA ragged, paged prefill and decoding (@zhyncs @yzh119 )
-
[ ] nextn parts can be shared with Triton Backend
EAGLE 2
@zhyncs @Ying1123
-
[x] implement sampling kernel in sgl-kernel (drop cutex) kernel part, python part
-
[x] bunch of fixes non greedy fix, disable cuda graph fix 1, fix 2, cleanup 1, cleanup 2, fix cuda graph capture failure, fix 2, reduce one draft forward
-
[ ] compatible with radix cache and chunked prefill (WIP @Ying1123 )
ref MTP support: https://github.com/sgl-project/sglang/pull/3582 v0.4.3.post1 release: https://github.com/sgl-project/sglang/pull/3638
SGLang supports MTP (nextn) in the Triton backend, achieving a speed of 77 tokens/s, twice as fast as other OSS LLM engines.
Woo, Thank you @zhyncs.
just try new image lmsysorg/sglang:v0.4.3.post2-cu125
the performance seems similar than 0.4.2 (on 16 x H20)
when running-req = 1, the gen throughput (token/s) is no more than previous.
What did I missed ?
I see compatible with radix cache and chunked prefill. How is it going?
Long context scenarios require this feature. @zhyncs
The current Eagle has two issues:
- It does not support chunked prefill.
- The draft model follows the same distributed strategy as the target model.
Does the community have any plans to address these two issues?
@yukavio chunked prefill support is on the way @merrymercy
Will you support DP + MTP ?
@zhyncs Hi, Do we multi MTP heads now? Is there an example?
@zhyncs @pankajroark Hi, is there any progress in supporting multi MTP heads?
Hi @pankajroark, do you any updates or docs about multi MTP headers, thanks.
working on multi MTP headers ing