Optimize MLA/GQA/MQA Triton decoding
Motivation
Optimize memory access for MLA/GQA/MQA decoding.
Modification
One block handle BLOCK_H q heads with shared k/v head. Inspired by https://github.com/InternLM/lmdeploy/pull/1649.
Tested on A100-80G: DeepSeek-V2-Lite
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: 128.0
Successful requests: 5000
Benchmark duration (s): 238.01
Total input tokens: 1187865
Total generated tokens: 1089941
Total generated tokens (retokenized): 1088588
Request throughput (req/s): 21.01
Input token throughput (tok/s): 4990.76
Output token throughput (tok/s): 4579.34
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 82822.78
Median E2E Latency (ms): 79653.86
---------------Time to First Token----------------
Mean TTFT (ms): 7167.67
Median TTFT (ms): 4229.26
P99 TTFT (ms): 21327.09
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 1073.28
Median TPOT (ms): 473.77
P99 TPOT (ms): 7907.65
---------------Inter-token Latency----------------
Mean ITL (ms): 409.14
Median ITL (ms): 165.46
P99 ITL (ms): 1814.59
==================================================
subject: abstract_algebra, #q:100, acc: 0.270
subject: anatomy, #q:135, acc: 0.504
subject: astronomy, #q:152, acc: 0.572
subject: business_ethics, #q:100, acc: 0.600
subject: clinical_knowledge, #q:265, acc: 0.642
subject: college_biology, #q:144, acc: 0.653
subject: college_chemistry, #q:100, acc: 0.410
subject: college_computer_science, #q:100, acc: 0.440
subject: college_mathematics, #q:100, acc: 0.380
subject: college_medicine, #q:173, acc: 0.601
Total latency: 33.251
Average accuracy: 0.535
Llama-3-8B
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: 128.0
Successful requests: 1000
Benchmark duration (s): 49.48
Total input tokens: 213987
Total generated tokens: 199779
Total generated tokens (retokenized): 198032
Request throughput (req/s): 20.21
Input token throughput (tok/s): 4324.67
Output token throughput (tok/s): 4037.53
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 20671.67
Median E2E Latency (ms): 19467.73
---------------Time to First Token----------------
Mean TTFT (ms): 3234.54
Median TTFT (ms): 1188.96
P99 TTFT (ms): 14154.28
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 186.30
Median TPOT (ms): 90.10
P99 TPOT (ms): 1976.89
---------------Inter-token Latency----------------
Mean ITL (ms): 91.19
Median ITL (ms): 61.85
P99 ITL (ms): 308.93
==================================================
subject: abstract_algebra, #q:100, acc: 0.330
subject: anatomy, #q:135, acc: 0.696
subject: astronomy, #q:152, acc: 0.684
subject: business_ethics, #q:100, acc: 0.630
subject: clinical_knowledge, #q:265, acc: 0.751
subject: college_biology, #q:144, acc: 0.771
subject: college_chemistry, #q:100, acc: 0.460
subject: college_computer_science, #q:100, acc: 0.520
subject: college_mathematics, #q:100, acc: 0.340
subject: college_medicine, #q:173, acc: 0.636
Total latency: 41.592
Average accuracy: 0.618
Reproduce:
python3 -m sglang.launch_server --model-path DeepSeek-V2-Lite --port 30000 --trust-remote-code --disable-radix-cache --enable-mla --tp=1
python3 -m sglang.bench_serving --backend sglang --tokenizer DeepSeek-V2-Lite --dataset-path /workdir/datasets/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000 --request-rate 128
python3 benchmark/mmlu/bench_sglang.py --nsub 10
python3 -m sglang.launch_server --model-path Meta-Llama-3-8B --port 30000 --trust-remote-code --disable-radix-cache --disable-flashinfer --tp=1
python3 -m sglang.bench_serving --backend sglang --tokenizer Meta-Llama-3-8B --dataset-path /workdir/datasets/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000 --request-rate 128
python3 benchmark/mmlu/bench_sglang.py --nsub 10
Nice work! TLDR: Reuse from L2 to block. Is it right? @ispobock
Nice work! TLDR: Reuse from L2 to block. Is it right? @ispobock
@zhyncs Previous version reuses from L2 cache. This version reuses shared k/v head from SMEM.
Tested on A100-80G: DeepSeek-V2-Lite
============ Serving Benchmark Result ============ Backend: sglang Traffic request rate: 128.0 Successful requests: 5000 Benchmark duration (s): 238.01 Total input tokens: 1187865 Total generated tokens: 1089941 Total generated tokens (retokenized): 1088588 Request throughput (req/s): 21.01 Input token throughput (tok/s): 4990.76 Output token throughput (tok/s): 4579.34 ----------------End-to-End Latency---------------- Mean E2E Latency (ms): 82822.78 Median E2E Latency (ms): 79653.86 ---------------Time to First Token---------------- Mean TTFT (ms): 7167.67 Median TTFT (ms): 4229.26 P99 TTFT (ms): 21327.09 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 1073.28 Median TPOT (ms): 473.77 P99 TPOT (ms): 7907.65 ---------------Inter-token Latency---------------- Mean ITL (ms): 409.14 Median ITL (ms): 165.46 P99 ITL (ms): 1814.59 ================================================== subject: abstract_algebra, #q:100, acc: 0.270 subject: anatomy, #q:135, acc: 0.504 subject: astronomy, #q:152, acc: 0.572 subject: business_ethics, #q:100, acc: 0.600 subject: clinical_knowledge, #q:265, acc: 0.642 subject: college_biology, #q:144, acc: 0.653 subject: college_chemistry, #q:100, acc: 0.410 subject: college_computer_science, #q:100, acc: 0.440 subject: college_mathematics, #q:100, acc: 0.380 subject: college_medicine, #q:173, acc: 0.601 Total latency: 33.251 Average accuracy: 0.535Llama-3-8B
============ Serving Benchmark Result ============ Backend: sglang Traffic request rate: 128.0 Successful requests: 1000 Benchmark duration (s): 49.48 Total input tokens: 213987 Total generated tokens: 199779 Total generated tokens (retokenized): 198032 Request throughput (req/s): 20.21 Input token throughput (tok/s): 4324.67 Output token throughput (tok/s): 4037.53 ----------------End-to-End Latency---------------- Mean E2E Latency (ms): 20671.67 Median E2E Latency (ms): 19467.73 ---------------Time to First Token---------------- Mean TTFT (ms): 3234.54 Median TTFT (ms): 1188.96 P99 TTFT (ms): 14154.28 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 186.30 Median TPOT (ms): 90.10 P99 TPOT (ms): 1976.89 ---------------Inter-token Latency---------------- Mean ITL (ms): 91.19 Median ITL (ms): 61.85 P99 ITL (ms): 308.93 ================================================== subject: abstract_algebra, #q:100, acc: 0.330 subject: anatomy, #q:135, acc: 0.696 subject: astronomy, #q:152, acc: 0.684 subject: business_ethics, #q:100, acc: 0.630 subject: clinical_knowledge, #q:265, acc: 0.751 subject: college_biology, #q:144, acc: 0.771 subject: college_chemistry, #q:100, acc: 0.460 subject: college_computer_science, #q:100, acc: 0.520 subject: college_mathematics, #q:100, acc: 0.340 subject: college_medicine, #q:173, acc: 0.636 Total latency: 41.592 Average accuracy: 0.618Reproduce:
python3 -m sglang.launch_server --model-path DeepSeek-V2-Lite --port 30000 --trust-remote-code --disable-radix-cache --enable-mla --tp=1 python3 -m sglang.bench_serving --backend sglang --tokenizer DeepSeek-V2-Lite --dataset-path /workdir/datasets/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000 --request-rate 128 python3 benchmark/mmlu/bench_sglang.py --nsub 10 python3 -m sglang.launch_server --model-path Meta-Llama-3-8B --port 30000 --trust-remote-code --disable-radix-cache --disable-flashinfer --tp=1 python3 -m sglang.bench_serving --backend sglang --tokenizer Meta-Llama-3-8B --dataset-path /workdir/datasets/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000 --request-rate 128 python3 benchmark/mmlu/bench_sglang.py --nsub 10
ref https://github.com/sgl-project/sglang/pull/905#issuecomment-2267514917
After a brief look, the throughput has roughly doubled compared to the previous MLA version, great work! cc @merrymercy @Ying1123 @hnyls2002
@Xu-Chen @lxww302 I noticed that you have used the implementation of SGLang's DeepSeek V2 TP8 MLA before. Could you help verify the performance of the new version, for example, on devices you have like A100 TP8, A800 TP8, H100 TP8, etc.? Thanks very mauch!
git clone -b decode_gqa_opt https://github.com/ispobock/sglang.git
cd sglang
pip install --upgrade pip
pip install -e "python[all]"
pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V2 --port 30000 --trust-remote-code --disable-radix-cache --enable-mla --tp=8
python3 -m sglang.bench_serving --backend sglang
@Xu-Chen @lxww302 I noticed that you have used the implementation of SGLang's DeepSeek V2 TP8 MLA before. Could you help verify the performance of the new version, for example, on devices you have like A100 TP8, A800 TP8, H100 TP8, etc.? Thanks very mauch!
git clone -b decode_gqa_opt https://github.com/ispobock/sglang.git cd sglang pip install --upgrade pip pip install -e "python[all]" pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/ python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V2 --port 30000 --trust-remote-code --disable-radix-cache --enable-mla --tp=8 python3 -m sglang.bench_serving --backend sglang
I have 8Xh100s, I executed your command
Backend: sglang
Traffic request rate: inf
Successful requests: 1000
Benchmark duration (s): 182.31
Total input tokens: 236142
Total generated tokens: 215614
Total generated tokens (retokenized): 215037
Request throughput (req/s): 5.49
Input token throughput (tok/s): 1295.28
Output token throughput (tok/s): 1182.68
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 75887.79
Median E2E Latency (ms): 77685.35
---------------Time to First Token----------------
Mean TTFT (ms): 43446.36
Median TTFT (ms): 39279.88
P99 TTFT (ms): 104146.94
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 181.96
Median TPOT (ms): 161.47
P99 TPOT (ms): 653.64
---------------Inter-token Latency----------------
Mean ITL (ms): 152.74
Median ITL (ms): 99.26
P99 ITL (ms): 465.58
==================================================```
Thanks! Is it H100 SXM or NVL? @81549361
May you collect the env info with python3 -m sglang.check_env. @81549361
Not sure if this could be helpful or not, but I ran llmperf for both main branch and incoming branch. Overall this PR seems to make things much faster:
- GPU: NVIDIA A40 TP=2 DP=1
- Model: Qwen/Qwen2-72B-Instruct-AWQ
llmperf command used
python token_benchmark_ray.py \
--model "${MODEL}" \
--mean-input-tokens 1500 \
--stddev-input-tokens 150 \
--mean-output-tokens 245 \
--stddev-output-tokens 20 \
--max-num-completed-requests "64" \
--timeout 7200 \
--num-concurrent-requests "8" \
--llm-api openai \
--additional-sampling-params '{}'
main branch
{
"version": "2023-08-31",
"mean_input_tokens": 1500,
"stddev_input_tokens": 150,
"mean_output_tokens": 245,
"stddev_output_tokens": 20,
"num_concurrent_requests": 8,
"results_inter_token_latency_s_quantiles_p25": 0.03990331099470551,
"results_inter_token_latency_s_quantiles_p50": 0.057948063652443406,
"results_inter_token_latency_s_quantiles_p75": 0.08040066503004678,
"results_inter_token_latency_s_quantiles_p90": 0.08383243498141633,
"results_inter_token_latency_s_quantiles_p95": 0.08516111126646178,
"results_inter_token_latency_s_quantiles_p99": 0.10164050496592587,
"results_inter_token_latency_s_mean": 0.06027883582796916,
"results_inter_token_latency_s_min": 0.03675615620323733,
"results_inter_token_latency_s_max": 0.1020314351556132,
"results_inter_token_latency_s_stddev": 0.0211621866217624,
"results_ttft_s_quantiles_p25": 0.4133454477414489,
"results_ttft_s_quantiles_p50": 1.016814228380099,
"results_ttft_s_quantiles_p75": 11.284791270736605,
"results_ttft_s_quantiles_p90": 11.749069100199268,
"results_ttft_s_quantiles_p95": 11.803535583987832,
"results_ttft_s_quantiles_p99": 11.955875016311182,
"results_ttft_s_mean": 5.338054827436281,
"results_ttft_s_min": 0.2691499590873718,
"results_ttft_s_max": 12.148427874781191,
"results_ttft_s_stddev": 5.495650480946165,
"results_end_to_end_latency_s_quantiles_p25": 11.498506030999124,
"results_end_to_end_latency_s_quantiles_p50": 15.51382327103056,
"results_end_to_end_latency_s_quantiles_p75": 22.9230548851192,
"results_end_to_end_latency_s_quantiles_p90": 23.657817971240732,
"results_end_to_end_latency_s_quantiles_p95": 23.97725157707464,
"results_end_to_end_latency_s_quantiles_p99": 24.61372328522615,
"results_end_to_end_latency_s_mean": 16.84320118615142,
"results_end_to_end_latency_s_min": 3.5896931253373623,
"results_end_to_end_latency_s_max": 25.067169249989092,
"results_end_to_end_latency_s_stddev": 6.076063540076458,
"results_request_output_throughput_token_per_s_quantiles_p25": 12.432897921487776,
"results_request_output_throughput_token_per_s_quantiles_p50": 17.950591526918625,
"results_request_output_throughput_token_per_s_quantiles_p75": 25.023589881617227,
"results_request_output_throughput_token_per_s_quantiles_p90": 25.61754857375858,
"results_request_output_throughput_token_per_s_quantiles_p95": 26.080372795146523,
"results_request_output_throughput_token_per_s_quantiles_p99": 27.12744569799552,
"results_request_output_throughput_token_per_s_mean": 18.7890127702506,
"results_request_output_throughput_token_per_s_min": 9.773737854436295,
"results_request_output_throughput_token_per_s_max": 27.204481327432568,
"results_request_output_throughput_token_per_s_stddev": 6.462698432888159,
"results_number_input_tokens_quantiles_p25": 1419.75,
"results_number_input_tokens_quantiles_p50": 1513.5,
"results_number_input_tokens_quantiles_p75": 1585.25,
"results_number_input_tokens_quantiles_p90": 1726.1000000000001,
"results_number_input_tokens_quantiles_p95": 1812.2499999999998,
"results_number_input_tokens_quantiles_p99": 1942.5299999999997,
"results_number_input_tokens_mean": 1515.53125,
"results_number_input_tokens_min": "1125",
"results_number_input_tokens_max": "1986",
"results_number_input_tokens_stddev": 157.1251617922921,
"results_number_output_tokens_quantiles_p25": 271.25,
"results_number_output_tokens_quantiles_p50": 287.0,
"results_number_output_tokens_quantiles_p75": 304.5,
"results_number_output_tokens_quantiles_p90": 318.0,
"results_number_output_tokens_quantiles_p95": 326.4,
"results_number_output_tokens_quantiles_p99": 340.37,
"results_number_output_tokens_mean": 280.546875,
"results_number_output_tokens_min": "78",
"results_number_output_tokens_max": "341",
"results_number_output_tokens_stddev": 43.62427229119711,
"results_num_requests_started": 64,
"results_error_rate": 0.0,
"results_number_errors": 0,
"results_error_code_frequency": "{}",
"results_mean_output_throughput_token_per_s": 122.91809365087381,
"results_num_completed_requests": 64,
"results_num_completed_requests_per_min": 26.288247263678944,
"timestamp": 1723922364
}
incoming branch
{
"version": "2023-08-31",
"mean_input_tokens": 1500,
"stddev_input_tokens": 150,
"mean_output_tokens": 245,
"stddev_output_tokens": 20,
"num_concurrent_requests": 8,
"results_inter_token_latency_s_quantiles_p25": 0.04048058146969138,
"results_inter_token_latency_s_quantiles_p50": 0.04134249718749723,
"results_inter_token_latency_s_quantiles_p75": 0.042773683461634744,
"results_inter_token_latency_s_quantiles_p90": 0.04477736409998821,
"results_inter_token_latency_s_quantiles_p95": 0.04621570852103804,
"results_inter_token_latency_s_quantiles_p99": 0.04943066709057319,
"results_inter_token_latency_s_mean": 0.04202164194913325,
"results_inter_token_latency_s_min": 0.03828613981456747,
"results_inter_token_latency_s_max": 0.05096760665209523,
"results_inter_token_latency_s_stddev": 0.0023344492257422154,
"results_ttft_s_quantiles_p25": 0.3779949996387586,
"results_ttft_s_quantiles_p50": 0.403224729700014,
"results_ttft_s_quantiles_p75": 0.44007199979387224,
"results_ttft_s_quantiles_p90": 0.4766438877210021,
"results_ttft_s_quantiles_p95": 0.4872294148663059,
"results_ttft_s_quantiles_p99": 0.49447528753429654,
"results_ttft_s_mean": 0.4035295032663271,
"results_ttft_s_min": 0.2787872082553804,
"results_ttft_s_max": 0.49528229096904397,
"results_ttft_s_stddev": 0.05853017613187361,
"results_end_to_end_latency_s_quantiles_p25": 10.952284958562814,
"results_end_to_end_latency_s_quantiles_p50": 11.724067542003468,
"results_end_to_end_latency_s_quantiles_p75": 12.392438833485357,
"results_end_to_end_latency_s_quantiles_p90": 12.949160708626732,
"results_end_to_end_latency_s_quantiles_p95": 13.369823349895887,
"results_end_to_end_latency_s_quantiles_p99": 13.602660472076385,
"results_end_to_end_latency_s_mean": 11.063488117179077,
"results_end_to_end_latency_s_min": 2.310943207703531,
"results_end_to_end_latency_s_max": 13.658869832754135,
"results_end_to_end_latency_s_stddev": 2.5735290879206163,
"results_request_output_throughput_token_per_s_quantiles_p25": 23.376963498120137,
"results_request_output_throughput_token_per_s_quantiles_p50": 24.13135072660546,
"results_request_output_throughput_token_per_s_quantiles_p75": 24.70095651189223,
"results_request_output_throughput_token_per_s_quantiles_p90": 25.105406335351436,
"results_request_output_throughput_token_per_s_quantiles_p95": 25.318698051259776,
"results_request_output_throughput_token_per_s_quantiles_p99": 26.00064578019821,
"results_request_output_throughput_token_per_s_mean": 23.819321580789712,
"results_request_output_throughput_token_per_s_min": 19.61920693264775,
"results_request_output_throughput_token_per_s_max": 26.11816971864744,
"results_request_output_throughput_token_per_s_stddev": 1.3040854008387603,
"results_number_input_tokens_quantiles_p25": 1419.75,
"results_number_input_tokens_quantiles_p50": 1513.5,
"results_number_input_tokens_quantiles_p75": 1585.25,
"results_number_input_tokens_quantiles_p90": 1726.1000000000001,
"results_number_input_tokens_quantiles_p95": 1812.2499999999998,
"results_number_input_tokens_quantiles_p99": 1942.5299999999997,
"results_number_input_tokens_mean": 1515.53125,
"results_number_input_tokens_min": "1125",
"results_number_input_tokens_max": "1986",
"results_number_input_tokens_stddev": 157.1251617922921,
"results_number_output_tokens_quantiles_p25": 265.75,
"results_number_output_tokens_quantiles_p50": 285.0,
"results_number_output_tokens_quantiles_p75": 296.25,
"results_number_output_tokens_quantiles_p90": 317.0,
"results_number_output_tokens_quantiles_p95": 322.0,
"results_number_output_tokens_quantiles_p99": 338.84999999999997,
"results_number_output_tokens_mean": 265.484375,
"results_number_output_tokens_min": "47",
"results_number_output_tokens_max": "342",
"results_number_output_tokens_stddev": 66.06466101119273,
"results_num_requests_started": 64,
"results_error_rate": 0.0,
"results_number_errors": 0,
"results_error_code_frequency": "{}",
"results_mean_output_throughput_token_per_s": 162.73324599263228,
"results_num_completed_requests": 64,
"results_num_completed_requests_per_min": 36.77803923322394,
"timestamp": 1723922279
}
python3 -m sglang.check_env
Python: 3.12.3 | packaged by Anaconda, Inc. | (main, May 6 2024, 19:46:43) [GCC 11.2.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H100 80GB HBM3
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.1, V12.1.105
CUDA Driver Version: 535.183.01
PyTorch: 2.4.0+cu121
flashinfer: 0.1.5+cu121torch2.4
triton: 3.0.0
transformers: 4.44.0
requests: 2.32.3
tqdm: 4.66.5
numpy: 1.26.4
aiohttp: 3.10.3
fastapi: 0.112.1
hf_transfer: 0.1.8
huggingface_hub: 0.24.5
interegular: 0.3.3
packaging: 23.2
PIL: 10.3.0
psutil: 5.9.8
pydantic: 2.8.2
uvicorn: 0.30.6
uvloop: 0.20.0
zmq: 26.0.3
vllm: 0.5.4
multipart: 0.0.9
openai: 1.40.8
anthropic: 0.34.0
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 0-159 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 0-159 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 0-159 0 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 0-159 0 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 0-159 0 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 0-159 0 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 0-159 0 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X 0-159 0 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
ulimit soft: 1048576
Not sure if this could be helpful or not, but I ran llmperf for both main branch and incoming branch. Overall this PR seems to make things much faster:
- GPU: NVIDIA A40 TP=2 DP=1
- Model: Qwen/Qwen2-72B-Instruct-AWQ
llmperf command used main branch incoming branch
What is your startup command? I don't see any noticeable improvement in llama3 8b FP8.
@81549361 Startup command I used for both are the same:
python3 -m sglang.launch_server \
--model-path "${MODEL}" \
--host 127.0.0.1 \
--port 8080 \
--context-length "4096" \
--max-prefill-tokens "16384" \
--mem-fraction-static "0.85" \
--schedule-conservativeness "0.05" \
--tp-size "2" \
--dp-size "1" \
--log-level-http warning
I don't see any noticeable improvement in llama3 8b FP8.
@81549361 Did you add --disable-flashinfer for both branches on llama3?
Awesome! Will test DeepSeek-V2-Chat on 8*A800 next week.
Tested on A800-80G: DeepSeek-V2-Lite
Main branch ( DeepSeek-V2-Lite-Chat on 1 * A800-80G )
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Successful requests: 1000
Benchmark duration (s): 90.75
Total input tokens: 236142
Total generated tokens: 215614
Total generated tokens (retokenized): 214087
Request throughput (req/s): 11.02
Input token throughput (tok/s): 2602.12
Output token throughput (tok/s): 2375.92
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 39248.93
Median E2E Latency (ms): 34872.34
---------------Time to First Token----------------
Mean TTFT (ms): 10523.55
Median TTFT (ms): 10943.01
P99 TTFT (ms): 15801.71
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 233.94
Median TPOT (ms): 151.23
P99 TPOT (ms): 1772.93
---------------Inter-token Latency----------------
Mean ITL (ms): 140.10
Median ITL (ms): 117.96
P99 ITL (ms): 385.24
==================================================
This PR ( DeepSeek-V2-Lite-Chat on 1 * A800-80G )
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Successful requests: 1000
Benchmark duration (s): 59.89
Total input tokens: 236142
Total generated tokens: 215614
Total generated tokens (retokenized): 214102
Request throughput (req/s): 16.70
Input token throughput (tok/s): 3942.60
Output token throughput (tok/s): 3599.87
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 25766.33
Median E2E Latency (ms): 23320.27
---------------Time to First Token----------------
Mean TTFT (ms): 9147.00
Median TTFT (ms): 9517.37
P99 TTFT (ms): 14099.85
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 161.66
Median TPOT (ms): 72.40
P99 TPOT (ms): 1690.97
---------------Inter-token Latency----------------
Mean ITL (ms): 82.39
Median ITL (ms): 56.74
P99 ITL (ms): 247.01
==================================================
Tested DeepSeek-V2-Chat-0628 on 8*A800
serve
python3 -m sglang.launch_server \
--model-path /data/model-cache/deepseek-ai/DeepSeek-V2-Chat-0628 \
--served-model-name deepseek-chat \
--tp 8 \
--enable-mla \
--disable-radix-cache \
--mem-fraction-static 0.87 \
--schedule-conservativeness 0.1 \
--chunked-prefill-size 32768 \
--max-prefill-tokens 163840 \
--trust-remote-code \
--host 0.0.0.0 \
--port 50521
test
python3 -m sglang.bench_serving \
--backend sglang \
--dataset-name sharegpt \
--dataset-path /data/model-cache/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json \
--model /data/model-cache/deepseek-ai/DeepSeek-V2-Chat-0628 \
--port 50521
result
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Successful requests: 1000
Benchmark duration (s): 604.96
Total input tokens: 236142
Total generated tokens: 215614
Total generated tokens (retokenized): 214714
Request throughput (req/s): 1.65
Input token throughput (tok/s): 390.34
Output token throughput (tok/s): 356.41
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 374607.65
Median E2E Latency (ms): 392302.17
---------------Time to First Token----------------
Mean TTFT (ms): 184913.93
Median TTFT (ms): 150008.79
P99 TTFT (ms): 424698.14
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 1651.19
Median TPOT (ms): 1100.21
P99 TPOT (ms): 10328.23
---------------Inter-token Latency----------------
Mean ITL (ms): 890.30
Median ITL (ms): 582.39
P99 ITL (ms): 3893.44
==================================================
Should I use base model? or my params not correct?
@halexan You don’t need to set this
--mem-fraction-static 0.87 \
--schedule-conservativeness 0.1 \
--chunked-prefill-size 32768 \
--max-prefill-tokens 163840 \
Tested DeepSeek-V2-Chat-0628 on 8*A800 server
/opt/conda/bin/python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V2-Chat-0628 --tp 8 --trust-remote-code --enable-mla --disable-radix-cache
test
/opt/conda/bin/python -m sglang.bench_serving --backend sglang --num-prompts 3000
This PR ( DeepSeek-V2-Chat-0628 on 8 * A800-80G )
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Successful requests: 3000
Benchmark duration (s): 498.49
Total input tokens: 714456
Total generated tokens: 656556
Total generated tokens (retokenized): 653778
Request throughput (req/s): 6.02
Input token throughput (tok/s): 1433.23
Output token throughput (tok/s): 1317.08
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 204276.17
Median E2E Latency (ms): 205499.99
---------------Time to First Token----------------
Mean TTFT (ms): 165516.98
Median TTFT (ms): 164192.44
P99 TTFT (ms): 353364.53
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 187.10
Median TPOT (ms): 186.48
P99 TPOT (ms): 398.82
---------------Inter-token Latency----------------
Mean ITL (ms): 180.25
Median ITL (ms): 108.96
P99 ITL (ms): 567.61
==================================================
@Xu-Chen
Does your 8*A800 has nvlink?
@Xu-Chen
Does your 8*A800 has nvlink?
Yes
H100 SXM TP8 with DeepSeek V2
current PR
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Successful requests: 5000
Benchmark duration (s): 581.84
Total input tokens: 1187865
Total generated tokens: 1089941
Total generated tokens (retokenized): 1086980
Request throughput (req/s): 8.59
Input token throughput (tok/s): 2041.57
Output token throughput (tok/s): 1873.27
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 266797.24
Median E2E Latency (ms): 272582.37
---------------Time to First Token----------------
Mean TTFT (ms): 239227.95
Median TTFT (ms): 248810.27
P99 TTFT (ms): 488867.21
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 132.70
Median TPOT (ms): 129.55
P99 TPOT (ms): 281.64
---------------Inter-token Latency----------------
Mean ITL (ms): 129.46
Median ITL (ms): 78.23
P99 ITL (ms): 453.92
==================================================
Compared to the main branch, it has improved by about 35%.
main branch
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Successful requests: 5000
Benchmark duration (s): 777.04
Total input tokens: 1187865
Total generated tokens: 1089941
Total generated tokens (retokenized): 1087011
Request throughput (req/s): 6.43
Input token throughput (tok/s): 1528.70
Output token throughput (tok/s): 1402.68
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 358316.01
Median E2E Latency (ms): 365857.50
---------------Time to First Token----------------
Mean TTFT (ms): 320752.33
Median TTFT (ms): 323528.82
P99 TTFT (ms): 670386.93
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 176.45
Median TPOT (ms): 176.47
P99 TPOT (ms): 272.25
---------------Inter-token Latency----------------
Mean ITL (ms): 175.99
Median ITL (ms): 128.65
P99 ITL (ms): 517.99
==================================================
I plan to merge this PR first, and the compatibility support for fp8 will be completed in another PR. @ispobock @merrymercy @Ying1123 @hnyls2002
To further improve performance, both W8A8 (FP8) and FP8 KV Cache are necessary and should be supported for DeepSeek V2.
Furthermore, should pay attention to the MLA implementation of FlashInfer ( https://github.com/flashinfer-ai/flashinfer/issues/237)
Furthermore, should pay attention to the MLA implementation of FlashInfer ( flashinfer-ai/flashinfer#237)
@jon-chuang When do you expect to complete the support for MLA in FlashInfer? May you synchronize the approximate time? Thanks.
@ispobock - do you mind telling a bit more about how you spotted this issue or this optimization? Did you see the potential issue when profiling something? Or were you directly inspired by https://github.com/InternLM/lmdeploy/pull/1649?
@microwish Yeah, we did the profiling first and found the decoding kernel took most of the time. And then we checked the kernel with ncu and get some directions for optimizing the memory access.
follow form https://zhuanlan.zhihu.com/p/714761319
@zhyncs
非常有意思的文章,我们大概在一个月前,在 SGLang 中实现了你文中提到的 A_CC_ME 版本 如何看待 DeepSeek 发布的 MoE 大模型 DeepSeek-V2?
以及做了 MQA 的优化 Optimize MLA/GQA/MQA Triton decoding by ispobock · Pull Request #1138 · sgl-project/sglang 事实上你提到的结论「MLA 由于是 Compressed_KV, 只有一个 Head,相当于 MQA,因此没办法再继续划分,每个 Rank 各自保留一份完整的 Compressed_KV」和我们的结论是一致的。DeepSeek 内部的实现,如果有一个 KV Cache Memory Pool 或许可以解决这个问题,TP 8 时,就不用每张卡都重复存一份了[酷]
@pika-jy qkv proj 的TP 已经是完整Dim(Dim = Head * Head_Dim)上分了,和mlp一样,attn 计算 head_dim上再分没必要了吧?这一维度本来就很小了,128,又额外引入通信
My question is: If qkv is fully partitioned on the DIM, then when calculating attention, can't each TP calculate its own qkv and then perform softmax at the end? Just like the traditional attention calculation. When calculating attention, gather the qk results from all TPs (only qk results are needed, no need to transfer kv), and then perform softmax. I don't quite understand why this approach doesn't work.
@pipul Just ignore this and please use English
@zhyncs why? have you solved the problem?
@pipul https://lmsys.org/blog/2024-12-04-sglang-v0-4/#data-parallelism-attention-for-deepseek-models
@zhyncs It seems that when you calculate attention, you only use DP parallelism and not TP parallelism? But I saw in the original paper that TP parallelism was used when calculating attention.
The attention part employs 4-way Tensor Parallelism (TP4) with Sequence Parallelism (SP), combined with 8-way Data Parallelism (DP8).