sglang icon indicating copy to clipboard operation
sglang copied to clipboard

Tiny refactor DeepSeek V3/R1 NextN shared experts fusion

Open lambert0312 opened this issue 1 year ago • 9 comments

Motivation

Ref https://github.com/sgl-project/sglang/pull/4918 Ref https://github.com/sgl-project/sglang/pull/5707 Ref https://github.com/sgl-project/sglang/pull/5793

Modifications

  • Extract the public method compute_shared_experts_fusion_weights and put it in deepseek_v2.py first.
  • Add necessary unit tests.

Acc in A800

python3 benchmark/gsm8k/bench_sglang.py --num-questions 200 --parallel 128 --num-shots 8 

Accuracy: 0.960
Invalid: 0.000
Latency: 14.804 s
Output throughput: 1451.247 token/s

Benchmark in A800

# qps 16
python3 -m sglang.bench_serving --backend sglang --num-prompts 200 --dataset-name random --max-concurrency 16 --random-input 256 --random-output 256 --seed 42

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max reqeuest concurrency:                16
Successful requests:                     200
Benchmark duration (s):                  57.65
Total input tokens:                      26096
Total generated tokens:                  26874
Total generated tokens (retokenized):    26763
Request throughput (req/s):              3.47
Input token throughput (tok/s):          452.70
Output token throughput (tok/s):         466.20
Total token throughput (tok/s):          918.90
Concurrency:                             15.77
Accept length:                           2.60
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   4546.43
Median E2E Latency (ms):                 4602.09
---------------Time to First Token----------------
Mean TTFT (ms):                          207.83
Median TTFT (ms):                        174.89
P99 TTFT (ms):                           476.63
---------------Inter-Token Latency----------------
Mean ITL (ms):                           32.54
Median ITL (ms):                         19.18
P95 ITL (ms):                            90.16
P99 ITL (ms):                            168.08
Max ITL (ms):                            389.73
==================================================

Checklist

  • [x] Format your code according to the Code Formatting with Pre-Commit.
  • [ ] Add unit tests as outlined in the Running Unit Tests.
  • [ ] Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
  • [x] Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
  • [ ] For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
  • [ ] Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

lambert0312 avatar Apr 08 '25 01:04 lambert0312

will fused shared experts still improve performance with nextn?

xihuai18 avatar Apr 08 '25 11:04 xihuai18

will fused shared experts still improve performance with nextn?

Yes, I'm still experimenting with the current effects

lambert0312 avatar Apr 08 '25 15:04 lambert0312

Can you add a test case?

merrymercy avatar Apr 21 '25 01:04 merrymercy

Can you add a test case?

Ok. I will add it

lambert0312 avatar Apr 21 '25 08:04 lambert0312

Maybe my PR can be firstly merged to make the commit history a bit more clear

fzyzcjy avatar Apr 21 '25 09:04 fzyzcjy

Maybe my PR can be firstly merged to make the commit history a bit more clear

Yes, I'm waiting for it to be merged @fzyzcjy

lambert0312 avatar Apr 21 '25 09:04 lambert0312

@BBuf @merrymercy @zhyncs could you take a look?

lambert0312 avatar Apr 25 '25 09:04 lambert0312

any update in this PR?

xihuai18 avatar May 07 '25 03:05 xihuai18

any update in this PR?

No, can merge it in. @xihuai18

lambert0312 avatar May 07 '25 23:05 lambert0312

@BBuf @fzyzcjy

zhyncs avatar Jun 09 '25 07:06 zhyncs