sglang [Track] DeepSeek V3/R1 nextn progress

Triton Backend

@ispobock @pankajroark

[x] refactor triton backend 1, 2
[x] support custom mask
[x] support EAGLE 2
[x] compatible with CUDA Graph
[x] support nextn I (single MTP head)
[ ] support next II (multi MTP heads) (WIP @pankajroark )

FlashInfer Backend

@zhyncs @yzh119

[x] compatible with disable MLA
[x] support FlashInfer nightly MLA ragged prefill and CUDA Core MLA decoding
[x] support FlashInfer v0.2.0.post3 MLA ragged, paged prefill and decoding (@zhyncs @yzh119 )
[ ] nextn parts can be shared with Triton Backend

EAGLE 2

@zhyncs @Ying1123

[x] implement sampling kernel in sgl-kernel (drop cutex) kernel part, python part
[x] bunch of fixes non greedy fix, disable cuda graph fix 1, fix 2, cleanup 1, cleanup 2, fix cuda graph capture failure, fix 2, reduce one draft forward
[ ] compatible with radix cache and chunked prefill (WIP @Ying1123 )

Feb 10 '25 14:02 zhyncs

ref MTP support: https://github.com/sgl-project/sglang/pull/3582 v0.4.3.post1 release: https://github.com/sgl-project/sglang/pull/3638

SGLang supports MTP (nextn) in the Triton backend, achieving a speed of 77 tokens/s, twice as fast as other OSS LLM engines.

Feb 17 '25 13:02 zhyncs

Woo, Thank you @zhyncs. just try new image lmsysorg/sglang:v0.4.3.post2-cu125 the performance seems similar than 0.4.2 (on 16 x H20) when running-req = 1, the gen throughput (token/s) is no more than previous.

What did I missed ?