sglang Optimize MLA/GQA/MQA Triton decoding

Motivation

Optimize memory access for MLA/GQA/MQA decoding.

Modification

One block handle BLOCK_H q heads with shared k/v head. Inspired by https://github.com/InternLM/lmdeploy/pull/1649.

Aug 17 '24 16:08 ispobock

Tested on A100-80G: DeepSeek-V2-Lite

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    128.0
Successful requests:                     5000
Benchmark duration (s):                  238.01
Total input tokens:                      1187865
Total generated tokens:                  1089941
Total generated tokens (retokenized):    1088588
Request throughput (req/s):              21.01
Input token throughput (tok/s):          4990.76
Output token throughput (tok/s):         4579.34
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   82822.78
Median E2E Latency (ms):                 79653.86
---------------Time to First Token----------------
Mean TTFT (ms):                          7167.67
Median TTFT (ms):                        4229.26
P99 TTFT (ms):                           21327.09
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          1073.28
Median TPOT (ms):                        473.77
P99 TPOT (ms):                           7907.65
---------------Inter-token Latency----------------
Mean ITL (ms):                           409.14
Median ITL (ms):                         165.46
P99 ITL (ms):                            1814.59
==================================================

subject: abstract_algebra, #q:100, acc: 0.270
subject: anatomy, #q:135, acc: 0.504
subject: astronomy, #q:152, acc: 0.572
subject: business_ethics, #q:100, acc: 0.600
subject: clinical_knowledge, #q:265, acc: 0.642
subject: college_biology, #q:144, acc: 0.653
subject: college_chemistry, #q:100, acc: 0.410
subject: college_computer_science, #q:100, acc: 0.440
subject: college_mathematics, #q:100, acc: 0.380
subject: college_medicine, #q:173, acc: 0.601
Total latency: 33.251
Average accuracy: 0.535

Llama-3-8B

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    128.0
Successful requests:                     1000
Benchmark duration (s):                  49.48
Total input tokens:                      213987
Total generated tokens:                  199779
Total generated tokens (retokenized):    198032
Request throughput (req/s):              20.21
Input token throughput (tok/s):          4324.67
Output token throughput (tok/s):         4037.53
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   20671.67
Median E2E Latency (ms):                 19467.73
---------------Time to First Token----------------
Mean TTFT (ms):                          3234.54
Median TTFT (ms):                        1188.96
P99 TTFT (ms):                           14154.28
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          186.30
Median TPOT (ms):                        90.10
P99 TPOT (ms):                           1976.89
---------------Inter-token Latency----------------
Mean ITL (ms):                           91.19
Median ITL (ms):                         61.85
P99 ITL (ms):                            308.93
==================================================

subject: abstract_algebra, #q:100, acc: 0.330
subject: anatomy, #q:135, acc: 0.696
subject: astronomy, #q:152, acc: 0.684
subject: business_ethics, #q:100, acc: 0.630
subject: clinical_knowledge, #q:265, acc: 0.751
subject: college_biology, #q:144, acc: 0.771
subject: college_chemistry, #q:100, acc: 0.460
subject: college_computer_science, #q:100, acc: 0.520
subject: college_mathematics, #q:100, acc: 0.340
subject: college_medicine, #q:173, acc: 0.636
Total latency: 41.592
Average accuracy: 0.618

Reproduce:

python3 -m sglang.launch_server --model-path DeepSeek-V2-Lite --port 30000 --trust-remote-code --disable-radix-cache --enable-mla --tp=1
python3 -m sglang.bench_serving --backend sglang --tokenizer DeepSeek-V2-Lite --dataset-path /workdir/datasets/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000 --request-rate 128
python3 benchmark/mmlu/bench_sglang.py --nsub 10

python3 -m sglang.launch_server --model-path Meta-Llama-3-8B --port 30000 --trust-remote-code --disable-radix-cache --disable-flashinfer --tp=1
python3 -m sglang.bench_serving --backend sglang --tokenizer Meta-Llama-3-8B --dataset-path /workdir/datasets/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000 --request-rate 128
python3 benchmark/mmlu/bench_sglang.py --nsub 10

Aug 17 '24 16:08 ispobock

Nice work! TLDR: Reuse from L2 to block. Is it right? @ispobock

Aug 17 '24 16:08 zhyncs

Nice work! TLDR: Reuse from L2 to block. Is it right? @ispobock

@zhyncs Previous version reuses from L2 cache. This version reuses shared k/v head from SMEM.

Aug 17 '24 16:08 ispobock

Tested on A100-80G: DeepSeek-V2-Lite

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    128.0
Successful requests:                     5000
Benchmark duration (s):                  238.01
Total input tokens:                      1187865
Total generated tokens:                  1089941
Total generated tokens (retokenized):    1088588
Request throughput (req/s):              21.01
Input token throughput (tok/s):          4990.76
Output token throughput (tok/s):         4579.34
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   82822.78
Median E2E Latency (ms):                 79653.86
---------------Time to First Token----------------
Mean TTFT (ms):                          7167.67
Median TTFT (ms):                        4229.26
P99 TTFT (ms):                           21327.09
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          1073.28
Median TPOT (ms):                        473.77
P99 TPOT (ms):                           7907.65
---------------Inter-token Latency----------------
Mean ITL (ms):                           409.14
Median ITL (ms):                         165.46
P99 ITL (ms):                            1814.59
==================================================

subject: abstract_algebra, #q:100, acc: 0.270
subject: anatomy, #q:135, acc: 0.504
subject: astronomy, #q:152, acc: 0.572
subject: business_ethics, #q:100, acc: 0.600
subject: clinical_knowledge, #q:265, acc: 0.642
subject: college_biology, #q:144, acc: 0.653
subject: college_chemistry, #q:100, acc: 0.410
subject: college_computer_science, #q:100, acc: 0.440
subject: college_mathematics, #q:100, acc: 0.380
subject: college_medicine, #q:173, acc: 0.601
Total latency: 33.251
Average accuracy: 0.535

Llama-3-8B

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    128.0
Successful requests:                     1000
Benchmark duration (s):                  49.48
Total input tokens:                      213987
Total generated tokens:                  199779
Total generated tokens (retokenized):    198032
Request throughput (req/s):              20.21
Input token throughput (tok/s):          4324.67
Output token throughput (tok/s):         4037.53
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   20671.67
Median E2E Latency (ms):                 19467.73
---------------Time to First Token----------------
Mean TTFT (ms):                          3234.54
Median TTFT (ms):                        1188.96
P99 TTFT (ms):                           14154.28
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          186.30
Median TPOT (ms):                        90.10
P99 TPOT (ms):                           1976.89
---------------Inter-token Latency----------------
Mean ITL (ms):                           91.19
Median ITL (ms):                         61.85
P99 ITL (ms):                            308.93
==================================================

subject: abstract_algebra, #q:100, acc: 0.330
subject: anatomy, #q:135, acc: 0.696
subject: astronomy, #q:152, acc: 0.684
subject: business_ethics, #q:100, acc: 0.630
subject: clinical_knowledge, #q:265, acc: 0.751
subject: college_biology, #q:144, acc: 0.771
subject: college_chemistry, #q:100, acc: 0.460
subject: college_computer_science, #q:100, acc: 0.520
subject: college_mathematics, #q:100, acc: 0.340
subject: college_medicine, #q:173, acc: 0.636
Total latency: 41.592
Average accuracy: 0.618

Reproduce:

python3 -m sglang.launch_server --model-path DeepSeek-V2-Lite --port 30000 --trust-remote-code --disable-radix-cache --enable-mla --tp=1
python3 -m sglang.bench_serving --backend sglang --tokenizer DeepSeek-V2-Lite --dataset-path /workdir/datasets/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000 --request-rate 128
python3 benchmark/mmlu/bench_sglang.py --nsub 10

python3 -m sglang.launch_server --model-path Meta-Llama-3-8B --port 30000 --trust-remote-code --disable-radix-cache --disable-flashinfer --tp=1
python3 -m sglang.bench_serving --backend sglang --tokenizer Meta-Llama-3-8B --dataset-path /workdir/datasets/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000 --request-rate 128
python3 benchmark/mmlu/bench_sglang.py --nsub 10

ref https://github.com/sgl-project/sglang/pull/905#issuecomment-2267514917

After a brief look, the throughput has roughly doubled compared to the previous MLA version, great work! cc @merrymercy @Ying1123 @hnyls2002

Aug 17 '24 16:08 zhyncs

@Xu-Chen @lxww302 I noticed that you have used the implementation of SGLang's DeepSeek V2 TP8 MLA before. Could you help verify the performance of the new version, for example, on devices you have like A100 TP8, A800 TP8, H100 TP8, etc.? Thanks very mauch!

git clone -b decode_gqa_opt https://github.com/ispobock/sglang.git
cd sglang

pip install --upgrade pip
pip install -e "python[all]"

pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/

python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V2 --port 30000 --trust-remote-code --disable-radix-cache --enable-mla --tp=8

python3 -m sglang.bench_serving --backend sglang

Aug 17 '24 17:08 zhyncs

@Xu-Chen @lxww302 I noticed that you have used the implementation of SGLang's DeepSeek V2 TP8 MLA before. Could you help verify the performance of the new version, for example, on devices you have like A100 TP8, A800 TP8, H100 TP8, etc.? Thanks very mauch!
git clone -b decode_gqa_opt https://github.com/ispobock/sglang.git
cd sglang

pip install --upgrade pip
pip install -e "python[all]"

pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/

python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V2 --port 30000 --trust-remote-code --disable-radix-cache --enable-mla --tp=8

python3 -m sglang.bench_serving --backend sglang

I have 8Xh100s, I executed your command

Backend:                                 sglang    
Traffic request rate:                    inf       
Successful requests:                     1000      
Benchmark duration (s):                  182.31    
Total input tokens:                      236142    
Total generated tokens:                  215614    
Total generated tokens (retokenized):    215037    
Request throughput (req/s):              5.49      
Input token throughput (tok/s):          1295.28   
Output token throughput (tok/s):         1182.68   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   75887.79  
Median E2E Latency (ms):                 77685.35  
---------------Time to First Token----------------
Mean TTFT (ms):                          43446.36  
Median TTFT (ms):                        39279.88  
P99 TTFT (ms):                           104146.94 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          181.96    
Median TPOT (ms):                        161.47    
P99 TPOT (ms):                           653.64    
---------------Inter-token Latency----------------
Mean ITL (ms):                           152.74    
Median ITL (ms):                         99.26     
P99 ITL (ms):                            465.58    
==================================================```

Aug 17 '24 19:08 81549361

Thanks! Is it H100 SXM or NVL? @81549361

Aug 17 '24 19:08 zhyncs

May you collect the env info with python3 -m sglang.check_env. @81549361

Aug 17 '24 19:08 zhyncs

Not sure if this could be helpful or not, but I ran llmperf for both main branch and incoming branch. Overall this PR seems to make things much faster:

GPU: NVIDIA A40 TP=2 DP=1
Model: Qwen/Qwen2-72B-Instruct-AWQ

llmperf command used

python token_benchmark_ray.py \
  --model "${MODEL}" \
  --mean-input-tokens 1500 \
  --stddev-input-tokens 150 \
  --mean-output-tokens 245 \
  --stddev-output-tokens 20 \
  --max-num-completed-requests "64" \
  --timeout 7200 \
  --num-concurrent-requests "8" \
  --llm-api openai \
  --additional-sampling-params '{}'

main branch

{
    "version": "2023-08-31",
    "mean_input_tokens": 1500,
    "stddev_input_tokens": 150,
    "mean_output_tokens": 245,
    "stddev_output_tokens": 20,
    "num_concurrent_requests": 8,
    "results_inter_token_latency_s_quantiles_p25": 0.03990331099470551,
    "results_inter_token_latency_s_quantiles_p50": 0.057948063652443406,
    "results_inter_token_latency_s_quantiles_p75": 0.08040066503004678,
    "results_inter_token_latency_s_quantiles_p90": 0.08383243498141633,
    "results_inter_token_latency_s_quantiles_p95": 0.08516111126646178,
    "results_inter_token_latency_s_quantiles_p99": 0.10164050496592587,
    "results_inter_token_latency_s_mean": 0.06027883582796916,
    "results_inter_token_latency_s_min": 0.03675615620323733,
    "results_inter_token_latency_s_max": 0.1020314351556132,
    "results_inter_token_latency_s_stddev": 0.0211621866217624,
    "results_ttft_s_quantiles_p25": 0.4133454477414489,
    "results_ttft_s_quantiles_p50": 1.016814228380099,
    "results_ttft_s_quantiles_p75": 11.284791270736605,
    "results_ttft_s_quantiles_p90": 11.749069100199268,
    "results_ttft_s_quantiles_p95": 11.803535583987832,
    "results_ttft_s_quantiles_p99": 11.955875016311182,
    "results_ttft_s_mean": 5.338054827436281,
    "results_ttft_s_min": 0.2691499590873718,
    "results_ttft_s_max": 12.148427874781191,
    "results_ttft_s_stddev": 5.495650480946165,
    "results_end_to_end_latency_s_quantiles_p25": 11.498506030999124,
    "results_end_to_end_latency_s_quantiles_p50": 15.51382327103056,
    "results_end_to_end_latency_s_quantiles_p75": 22.9230548851192,
    "results_end_to_end_latency_s_quantiles_p90": 23.657817971240732,
    "results_end_to_end_latency_s_quantiles_p95": 23.97725157707464,
    "results_end_to_end_latency_s_quantiles_p99": 24.61372328522615,
    "results_end_to_end_latency_s_mean": 16.84320118615142,
    "results_end_to_end_latency_s_min": 3.5896931253373623,
    "results_end_to_end_latency_s_max": 25.067169249989092,
    "results_end_to_end_latency_s_stddev": 6.076063540076458,
    "results_request_output_throughput_token_per_s_quantiles_p25": 12.432897921487776,
    "results_request_output_throughput_token_per_s_quantiles_p50": 17.950591526918625,
    "results_request_output_throughput_token_per_s_quantiles_p75": 25.023589881617227,
    "results_request_output_throughput_token_per_s_quantiles_p90": 25.61754857375858,
    "results_request_output_throughput_token_per_s_quantiles_p95": 26.080372795146523,
    "results_request_output_throughput_token_per_s_quantiles_p99": 27.12744569799552,
    "results_request_output_throughput_token_per_s_mean": 18.7890127702506,
    "results_request_output_throughput_token_per_s_min": 9.773737854436295,
    "results_request_output_throughput_token_per_s_max": 27.204481327432568,
    "results_request_output_throughput_token_per_s_stddev": 6.462698432888159,
    "results_number_input_tokens_quantiles_p25": 1419.75,
    "results_number_input_tokens_quantiles_p50": 1513.5,
    "results_number_input_tokens_quantiles_p75": 1585.25,
    "results_number_input_tokens_quantiles_p90": 1726.1000000000001,
    "results_number_input_tokens_quantiles_p95": 1812.2499999999998,
    "results_number_input_tokens_quantiles_p99": 1942.5299999999997,
    "results_number_input_tokens_mean": 1515.53125,
    "results_number_input_tokens_min": "1125",
    "results_number_input_tokens_max": "1986",
    "results_number_input_tokens_stddev": 157.1251617922921,
    "results_number_output_tokens_quantiles_p25": 271.25,
    "results_number_output_tokens_quantiles_p50": 287.0,
    "results_number_output_tokens_quantiles_p75": 304.5,
    "results_number_output_tokens_quantiles_p90": 318.0,
    "results_number_output_tokens_quantiles_p95": 326.4,
    "results_number_output_tokens_quantiles_p99": 340.37,
    "results_number_output_tokens_mean": 280.546875,
    "results_number_output_tokens_min": "78",
    "results_number_output_tokens_max": "341",
    "results_number_output_tokens_stddev": 43.62427229119711,
    "results_num_requests_started": 64,
    "results_error_rate": 0.0,
    "results_number_errors": 0,
    "results_error_code_frequency": "{}",
    "results_mean_output_throughput_token_per_s": 122.91809365087381,
    "results_num_completed_requests": 64,
    "results_num_completed_requests_per_min": 26.288247263678944,
    "timestamp": 1723922364
}

incoming branch

{
    "version": "2023-08-31",
    "mean_input_tokens": 1500,
    "stddev_input_tokens": 150,
    "mean_output_tokens": 245,
    "stddev_output_tokens": 20,
    "num_concurrent_requests": 8,
    "results_inter_token_latency_s_quantiles_p25": 0.04048058146969138,
    "results_inter_token_latency_s_quantiles_p50": 0.04134249718749723,
    "results_inter_token_latency_s_quantiles_p75": 0.042773683461634744,
    "results_inter_token_latency_s_quantiles_p90": 0.04477736409998821,
    "results_inter_token_latency_s_quantiles_p95": 0.04621570852103804,
    "results_inter_token_latency_s_quantiles_p99": 0.04943066709057319,
    "results_inter_token_latency_s_mean": 0.04202164194913325,
    "results_inter_token_latency_s_min": 0.03828613981456747,
    "results_inter_token_latency_s_max": 0.05096760665209523,
    "results_inter_token_latency_s_stddev": 0.0023344492257422154,
    "results_ttft_s_quantiles_p25": 0.3779949996387586,
    "results_ttft_s_quantiles_p50": 0.403224729700014,
    "results_ttft_s_quantiles_p75": 0.44007199979387224,
    "results_ttft_s_quantiles_p90": 0.4766438877210021,
    "results_ttft_s_quantiles_p95": 0.4872294148663059,
    "results_ttft_s_quantiles_p99": 0.49447528753429654,
    "results_ttft_s_mean": 0.4035295032663271,
    "results_ttft_s_min": 0.2787872082553804,
    "results_ttft_s_max": 0.49528229096904397,
    "results_ttft_s_stddev": 0.05853017613187361,
    "results_end_to_end_latency_s_quantiles_p25": 10.952284958562814,
    "results_end_to_end_latency_s_quantiles_p50": 11.724067542003468,
    "results_end_to_end_latency_s_quantiles_p75": 12.392438833485357,
    "results_end_to_end_latency_s_quantiles_p90": 12.949160708626732,
    "results_end_to_end_latency_s_quantiles_p95": 13.369823349895887,
    "results_end_to_end_latency_s_quantiles_p99": 13.602660472076385,
    "results_end_to_end_latency_s_mean": 11.063488117179077,
    "results_end_to_end_latency_s_min": 2.310943207703531,
    "results_end_to_end_latency_s_max": 13.658869832754135,
    "results_end_to_end_latency_s_stddev": 2.5735290879206163,
    "results_request_output_throughput_token_per_s_quantiles_p25": 23.376963498120137,
    "results_request_output_throughput_token_per_s_quantiles_p50": 24.13135072660546,
    "results_request_output_throughput_token_per_s_quantiles_p75": 24.70095651189223,
    "results_request_output_throughput_token_per_s_quantiles_p90": 25.105406335351436,
    "results_request_output_throughput_token_per_s_quantiles_p95": 25.318698051259776,
    "results_request_output_throughput_token_per_s_quantiles_p99": 26.00064578019821,
    "results_request_output_throughput_token_per_s_mean": 23.819321580789712,
    "results_request_output_throughput_token_per_s_min": 19.61920693264775,
    "results_request_output_throughput_token_per_s_max": 26.11816971864744,
    "results_request_output_throughput_token_per_s_stddev": 1.3040854008387603,
    "results_number_input_tokens_quantiles_p25": 1419.75,
    "results_number_input_tokens_quantiles_p50": 1513.5,
    "results_number_input_tokens_quantiles_p75": 1585.25,
    "results_number_input_tokens_quantiles_p90": 1726.1000000000001,
    "results_number_input_tokens_quantiles_p95": 1812.2499999999998,
    "results_number_input_tokens_quantiles_p99": 1942.5299999999997,
    "results_number_input_tokens_mean": 1515.53125,
    "results_number_input_tokens_min": "1125",
    "results_number_input_tokens_max": "1986",
    "results_number_input_tokens_stddev": 157.1251617922921,
    "results_number_output_tokens_quantiles_p25": 265.75,
    "results_number_output_tokens_quantiles_p50": 285.0,
    "results_number_output_tokens_quantiles_p75": 296.25,
    "results_number_output_tokens_quantiles_p90": 317.0,
    "results_number_output_tokens_quantiles_p95": 322.0,
    "results_number_output_tokens_quantiles_p99": 338.84999999999997,
    "results_number_output_tokens_mean": 265.484375,
    "results_number_output_tokens_min": "47",
    "results_number_output_tokens_max": "342",
    "results_number_output_tokens_stddev": 66.06466101119273,
    "results_num_requests_started": 64,
    "results_error_rate": 0.0,
    "results_number_errors": 0,
    "results_error_code_frequency": "{}",
    "results_mean_output_throughput_token_per_s": 162.73324599263228,
    "results_num_completed_requests": 64,
    "results_num_completed_requests_per_min": 36.77803923322394,
    "timestamp": 1723922279
}

Aug 17 '24 19:08 vhain

python3 -m sglang.check_env

Python: 3.12.3 | packaged by Anaconda, Inc. | (main, May  6 2024, 19:46:43) [GCC 11.2.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H100 80GB HBM3
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.1, V12.1.105
CUDA Driver Version: 535.183.01
PyTorch: 2.4.0+cu121
flashinfer: 0.1.5+cu121torch2.4
triton: 3.0.0
transformers: 4.44.0
requests: 2.32.3
tqdm: 4.66.5
numpy: 1.26.4
aiohttp: 3.10.3
fastapi: 0.112.1
hf_transfer: 0.1.8
huggingface_hub: 0.24.5
interegular: 0.3.3
packaging: 23.2
PIL: 10.3.0
psutil: 5.9.8
pydantic: 2.8.2
uvicorn: 0.30.6
uvloop: 0.20.0
zmq: 26.0.3
vllm: 0.5.4
multipart: 0.0.9
openai: 1.40.8
anthropic: 0.34.0
NVIDIA Topology: 
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    0-159   0               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    0-159   0               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    0-159   0               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    0-159   0               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    0-159   0               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    0-159   0               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    0-159   0               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      0-159   0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

ulimit soft: 1048576

Aug 17 '24 19:08 81549361

Not sure if this could be helpful or not, but I ran llmperf for both main branch and incoming branch. Overall this PR seems to make things much faster:

GPU: NVIDIA A40 TP=2 DP=1

Model: Qwen/Qwen2-72B-Instruct-AWQ

llmperf command used main branch incoming branch

What is your startup command? I don't see any noticeable improvement in llama3 8b FP8.

Aug 17 '24 20:08 81549361

@81549361 Startup command I used for both are the same:

python3 -m sglang.launch_server \
  --model-path "${MODEL}" \
  --host 127.0.0.1 \
  --port 8080 \
  --context-length "4096" \
  --max-prefill-tokens "16384" \
  --mem-fraction-static "0.85" \
  --schedule-conservativeness "0.05" \
  --tp-size "2" \
  --dp-size "1" \
  --log-level-http warning

Aug 17 '24 20:08 vhain

I don't see any noticeable improvement in llama3 8b FP8.

@81549361 Did you add --disable-flashinfer for both branches on llama3?

Aug 17 '24 23:08 ispobock

Awesome! Will test DeepSeek-V2-Chat on 8*A800 next week.

Tested on A800-80G: DeepSeek-V2-Lite

Main branch ( DeepSeek-V2-Lite-Chat on 1 * A800-80G )

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Successful requests:                     1000
Benchmark duration (s):                  90.75
Total input tokens:                      236142
Total generated tokens:                  215614
Total generated tokens (retokenized):    214087
Request throughput (req/s):              11.02
Input token throughput (tok/s):          2602.12
Output token throughput (tok/s):         2375.92
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   39248.93
Median E2E Latency (ms):                 34872.34
---------------Time to First Token----------------
Mean TTFT (ms):                          10523.55
Median TTFT (ms):                        10943.01
P99 TTFT (ms):                           15801.71
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          233.94
Median TPOT (ms):                        151.23
P99 TPOT (ms):                           1772.93
---------------Inter-token Latency----------------
Mean ITL (ms):                           140.10
Median ITL (ms):                         117.96
P99 ITL (ms):                            385.24
==================================================

This PR ( DeepSeek-V2-Lite-Chat on 1 * A800-80G )

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Successful requests:                     1000
Benchmark duration (s):                  59.89
Total input tokens:                      236142
Total generated tokens:                  215614
Total generated tokens (retokenized):    214102
Request throughput (req/s):              16.70
Input token throughput (tok/s):          3942.60
Output token throughput (tok/s):         3599.87
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   25766.33
Median E2E Latency (ms):                 23320.27
---------------Time to First Token----------------
Mean TTFT (ms):                          9147.00
Median TTFT (ms):                        9517.37
P99 TTFT (ms):                           14099.85
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          161.66
Median TPOT (ms):                        72.40
P99 TPOT (ms):                           1690.97
---------------Inter-token Latency----------------
Mean ITL (ms):                           82.39
Median ITL (ms):                         56.74
P99 ITL (ms):                            247.01
==================================================

Aug 18 '24 03:08 Xu-Chen

Tested DeepSeek-V2-Chat-0628 on 8*A800

serve

python3 -m sglang.launch_server \
    --model-path /data/model-cache/deepseek-ai/DeepSeek-V2-Chat-0628 \
    --served-model-name deepseek-chat \
    --tp 8 \
    --enable-mla \
    --disable-radix-cache \
    --mem-fraction-static 0.87 \
    --schedule-conservativeness 0.1 \
    --chunked-prefill-size 32768 \
    --max-prefill-tokens 163840 \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 50521

test

python3 -m sglang.bench_serving \
    --backend sglang \
    --dataset-name sharegpt \
    --dataset-path /data/model-cache/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json \
    --model /data/model-cache/deepseek-ai/DeepSeek-V2-Chat-0628 \
    --port 50521

result

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Successful requests:                     1000
Benchmark duration (s):                  604.96
Total input tokens:                      236142
Total generated tokens:                  215614
Total generated tokens (retokenized):    214714
Request throughput (req/s):              1.65
Input token throughput (tok/s):          390.34
Output token throughput (tok/s):         356.41
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   374607.65
Median E2E Latency (ms):                 392302.17
---------------Time to First Token----------------
Mean TTFT (ms):                          184913.93
Median TTFT (ms):                        150008.79
P99 TTFT (ms):                           424698.14
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          1651.19
Median TPOT (ms):                        1100.21
P99 TPOT (ms):                           10328.23
---------------Inter-token Latency----------------
Mean ITL (ms):                           890.30
Median ITL (ms):                         582.39
P99 ITL (ms):                            3893.44
==================================================

Should I use base model? or my params not correct?

Aug 18 '24 07:08 halexan

@halexan You don’t need to set this

--mem-fraction-static 0.87 \
    --schedule-conservativeness 0.1 \
    --chunked-prefill-size 32768 \
    --max-prefill-tokens 163840 \

Aug 18 '24 07:08 zhyncs

Tested DeepSeek-V2-Chat-0628 on 8*A800 server

/opt/conda/bin/python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V2-Chat-0628 --tp 8 --trust-remote-code --enable-mla --disable-radix-cache

test

/opt/conda/bin/python -m sglang.bench_serving --backend sglang --num-prompts 3000

This PR ( DeepSeek-V2-Chat-0628 on 8 * A800-80G )

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Successful requests:                     3000
Benchmark duration (s):                  498.49
Total input tokens:                      714456
Total generated tokens:                  656556
Total generated tokens (retokenized):    653778
Request throughput (req/s):              6.02
Input token throughput (tok/s):          1433.23
Output token throughput (tok/s):         1317.08
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   204276.17
Median E2E Latency (ms):                 205499.99
---------------Time to First Token----------------
Mean TTFT (ms):                          165516.98
Median TTFT (ms):                        164192.44
P99 TTFT (ms):                           353364.53
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          187.10
Median TPOT (ms):                        186.48
P99 TPOT (ms):                           398.82
---------------Inter-token Latency----------------
Mean ITL (ms):                           180.25
Median ITL (ms):                         108.96
P99 ITL (ms):                            567.61
==================================================

Aug 18 '24 13:08 Xu-Chen

@Xu-Chen

Does your 8*A800 has nvlink?

Aug 18 '24 23:08 halexan

@Xu-Chen

Does your 8*A800 has nvlink?

Yes

Aug 19 '24 02:08 Xu-Chen

H100 SXM TP8 with DeepSeek V2

current PR

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Successful requests:                     5000
Benchmark duration (s):                  581.84
Total input tokens:                      1187865
Total generated tokens:                  1089941
Total generated tokens (retokenized):    1086980
Request throughput (req/s):              8.59
Input token throughput (tok/s):          2041.57
Output token throughput (tok/s):         1873.27
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   266797.24
Median E2E Latency (ms):                 272582.37
---------------Time to First Token----------------
Mean TTFT (ms):                          239227.95
Median TTFT (ms):                        248810.27
P99 TTFT (ms):                           488867.21
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          132.70
Median TPOT (ms):                        129.55
P99 TPOT (ms):                           281.64
---------------Inter-token Latency----------------
Mean ITL (ms):                           129.46
Median ITL (ms):                         78.23
P99 ITL (ms):                            453.92
==================================================

Compared to the main branch, it has improved by about 35%.

main branch

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Successful requests:                     5000
Benchmark duration (s):                  777.04
Total input tokens:                      1187865
Total generated tokens:                  1089941
Total generated tokens (retokenized):    1087011
Request throughput (req/s):              6.43
Input token throughput (tok/s):          1528.70
Output token throughput (tok/s):         1402.68
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   358316.01
Median E2E Latency (ms):                 365857.50
---------------Time to First Token----------------
Mean TTFT (ms):                          320752.33
Median TTFT (ms):                        323528.82
P99 TTFT (ms):                           670386.93
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          176.45
Median TPOT (ms):                        176.47
P99 TPOT (ms):                           272.25
---------------Inter-token Latency----------------
Mean ITL (ms):                           175.99
Median ITL (ms):                         128.65
P99 ITL (ms):                            517.99
==================================================

I plan to merge this PR first, and the compatibility support for fp8 will be completed in another PR. @ispobock @merrymercy @Ying1123 @hnyls2002

Aug 19 '24 09:08 zhyncs

To further improve performance, both W8A8 (FP8) and FP8 KV Cache are necessary and should be supported for DeepSeek V2.

Aug 19 '24 09:08 zhyncs

Furthermore, should pay attention to the MLA implementation of FlashInfer ( https://github.com/flashinfer-ai/flashinfer/issues/237)

Aug 19 '24 10:08 Xu-Chen

Furthermore, should pay attention to the MLA implementation of FlashInfer ( flashinfer-ai/flashinfer#237)

@jon-chuang When do you expect to complete the support for MLA in FlashInfer? May you synchronize the approximate time? Thanks.

Aug 19 '24 10:08 zhyncs

@ispobock - do you mind telling a bit more about how you spotted this issue or this optimization? Did you see the potential issue when profiling something? Or were you directly inspired by https://github.com/InternLM/lmdeploy/pull/1649?

Oct 11 '24 16:10 microwish

@microwish Yeah, we did the profiling first and found the decoding kernel took most of the time. And then we checked the kernel with ncu and get some directions for optimizing the memory access.

Oct 12 '24 15:10 ispobock

follow form https://zhuanlan.zhihu.com/p/714761319

@zhyncs

非常有意思的文章，我们大概在一个月前，在 SGLang 中实现了你文中提到的 A_CC_ME 版本如何看待 DeepSeek 发布的 MoE 大模型 DeepSeek-V2？

以及做了 MQA 的优化 Optimize MLA/GQA/MQA Triton decoding by ispobock · Pull Request #1138 · sgl-project/sglang 事实上你提到的结论「MLA 由于是 Compressed_KV，只有一个 Head，相当于 MQA，因此没办法再继续划分，每个 Rank 各自保留一份完整的 Compressed_KV」和我们的结论是一致的。DeepSeek 内部的实现，如果有一个 KV Cache Memory Pool 或许可以解决这个问题，TP 8 时，就不用每张卡都重复存一份了[酷]

@pika-jy qkv proj 的TP 已经是完整Dim（Dim = Head * Head_Dim）上分了，和mlp一样，attn 计算 head_dim上再分没必要了吧？这一维度本来就很小了，128，又额外引入通信

My question is: If qkv is fully partitioned on the DIM, then when calculating attention, can't each TP calculate its own qkv and then perform softmax at the end? Just like the traditional attention calculation. When calculating attention, gather the qk results from all TPs (only qk results are needed, no need to transfer kv), and then perform softmax. I don't quite understand why this approach doesn't work.

Jan 07 '25 05:01 pipul

@pipul Just ignore this and please use English

Jan 07 '25 05:01 zhyncs

@zhyncs why? have you solved the problem?

Jan 07 '25 05:01 pipul

@pipul https://lmsys.org/blog/2024-12-04-sglang-v0-4/#data-parallelism-attention-for-deepseek-models

Jan 07 '25 05:01 zhyncs

@zhyncs It seems that when you calculate attention, you only use DP parallelism and not TP parallelism? But I saw in the original paper that TP parallelism was used when calculating attention.

The attention part employs 4-way Tensor Parallelism (TP4) with Sequence Parallelism (SP), combined with 8-way Data Parallelism (DP8).

Jan 07 '25 05:01 pipul