Tiny refactor DeepSeek V3/R1 NextN shared experts fusion
Motivation
Ref https://github.com/sgl-project/sglang/pull/4918 Ref https://github.com/sgl-project/sglang/pull/5707 Ref https://github.com/sgl-project/sglang/pull/5793
Modifications
- Extract the public method
compute_shared_experts_fusion_weightsand put it indeepseek_v2.pyfirst. - Add necessary unit tests.
Acc in A800
python3 benchmark/gsm8k/bench_sglang.py --num-questions 200 --parallel 128 --num-shots 8
Accuracy: 0.960
Invalid: 0.000
Latency: 14.804 s
Output throughput: 1451.247 token/s
Benchmark in A800
# qps 16
python3 -m sglang.bench_serving --backend sglang --num-prompts 200 --dataset-name random --max-concurrency 16 --random-input 256 --random-output 256 --seed 42
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max reqeuest concurrency: 16
Successful requests: 200
Benchmark duration (s): 57.65
Total input tokens: 26096
Total generated tokens: 26874
Total generated tokens (retokenized): 26763
Request throughput (req/s): 3.47
Input token throughput (tok/s): 452.70
Output token throughput (tok/s): 466.20
Total token throughput (tok/s): 918.90
Concurrency: 15.77
Accept length: 2.60
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 4546.43
Median E2E Latency (ms): 4602.09
---------------Time to First Token----------------
Mean TTFT (ms): 207.83
Median TTFT (ms): 174.89
P99 TTFT (ms): 476.63
---------------Inter-Token Latency----------------
Mean ITL (ms): 32.54
Median ITL (ms): 19.18
P95 ITL (ms): 90.16
P99 ITL (ms): 168.08
Max ITL (ms): 389.73
==================================================
Checklist
- [x] Format your code according to the Code Formatting with Pre-Commit.
- [ ] Add unit tests as outlined in the Running Unit Tests.
- [ ] Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
- [x] Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
- [ ] For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
- [ ] Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.
will fused shared experts still improve performance with nextn?
will fused shared experts still improve performance with nextn?
Yes, I'm still experimenting with the current effects
Can you add a test case?
Can you add a test case?
Ok. I will add it
Maybe my PR can be firstly merged to make the commit history a bit more clear
Maybe my PR can be firstly merged to make the commit history a bit more clear
Yes, I'm waiting for it to be merged @fzyzcjy
@BBuf @merrymercy @zhyncs could you take a look?
any update in this PR?
any update in this PR?
No, can merge it in. @xihuai18
@BBuf @fzyzcjy