Add turbomind_rms_norm to accelerate QK norm in Qwen3 models
Motivation
For RMSNorm with head_dim <= 128, the QK-norm implementation in lmdeploy performs better than the current flashinfer RMSNorm implementation. In benchmark on H20 (head_dim = 128, head_num = 48, token_num = 4096), the latency is reduced from 269 µs to 69 µs.
Modifications
- Added a minimal implementation of
turbomind_rms_norm. - Updated RMSNorm’s forward_cuda and forward_xpu to use
turbomind_rms_normwhen hidden_size <= 128. - Added unit tests and benchmarks for
turbomind_rms_norm.
Accuracy Tests
Unit tests sgl-kernel/tests/test_norm.py pass.
On H20, Model Qwen/Qwen3-8B-FP8: python3 -m sglang.test.few_shot_gsm8k --num-questions 1000
before:
Accuracy: 0.907
Invalid: 0.000
Latency: 40.152 s
Output throughput: 2997.586 token/s
after:
Accuracy: 0.909
Invalid: 0.000
Latency: 39.234 s
Output throughput: 3067.587 token/s
Benchmarking and Profiling
kernel benchmark (on H20): python /sgl-workspace/sglang/sgl-kernel/benchmark/bench_turbomind_rmsnorm.py
results:
rmsnorm-performance(head_dim=128):
head_num token_num SGLang Turbomind
0 16.0 1.0 2.455256 2.400793
1 16.0 2.0 2.902393 2.554233
2 16.0 4.0 2.936181 2.553624
3 16.0 8.0 2.900588 2.606192
4 16.0 16.0 3.081954 2.623438
5 16.0 32.0 3.331822 2.670131
6 16.0 64.0 3.920498 2.767891
7 16.0 128.0 5.263943 2.877911
8 16.0 256.0 7.914170 3.171015
9 16.0 512.0 13.160407 4.080311
10 16.0 1024.0 23.700711 5.538099
11 16.0 2048.0 44.501812 8.726369
12 16.0 4096.0 90.898067 15.671547
13 32.0 1.0 2.907672 2.552743
14 32.0 2.0 2.936283 2.553682
15 32.0 4.0 2.900070 2.608614
16 32.0 8.0 3.081353 2.623271
17 32.0 16.0 3.332297 2.670330
18 32.0 32.0 3.920480 2.767877
19 32.0 64.0 5.264174 2.861957
20 32.0 128.0 7.914641 3.171004
21 32.0 256.0 13.178706 4.079834
22 32.0 512.0 23.686496 5.537208
23 32.0 1024.0 44.558687 8.731087
24 32.0 2048.0 90.815924 15.645469
25 32.0 4096.0 180.847845 44.130245
26 48.0 1.0 2.920537 2.548013
27 48.0 2.0 2.955938 2.590711
28 48.0 4.0 3.074632 2.611818
29 48.0 8.0 3.220033 2.651170
30 48.0 16.0 3.599001 2.723399
31 48.0 32.0 4.594748 2.812742
32 48.0 64.0 6.809815 3.288367
33 48.0 128.0 10.555641 3.487169
34 48.0 256.0 18.607111 4.872875
35 48.0 512.0 34.088300 7.114875
36 48.0 1024.0 66.516190 11.957884
37 48.0 2048.0 136.611495 28.148620
38 48.0 4096.0 269.632875 69.378489
e2e benchmark (on H20, model: Qwen/Qwen3-0.6B-FP8):
python3 -m sglang.bench_serving --backend sglang \
--model $MODEL_PATH \
--dataset-name random \
--random-input-len 4096 \
--random-output-len 32 \
--random-range-ratio 1 \
--request-rate 64 \
--max-concurrency 64 \
--num-prompts 256 \
--host $SERVER_IP --port $SERVER_PORT
results(baseline, rmsnorm):
============ Serving Benchmark Result ============ Backend: sglang Traffic request rate: 64.0 Max request concurrency: 64 Successful requests: 256
Benchmark duration (s): 13.76
Total input tokens: 1048576
Total generated tokens: 8192
Total generated tokens (retokenized): 8190
Request throughput (req/s): 18.60
Input token throughput (tok/s): 76194.82
Output token throughput (tok/s): 595.27
Total token throughput (tok/s): 76790.10
Concurrency: 61.76
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 3319.85
Median E2E Latency (ms): 3432.59
---------------Time to First Token----------------
Mean TTFT (ms): 1375.97
Median TTFT (ms): 1393.31
P99 TTFT (ms): 2432.91
--------------Time Per Output Token---------------
Mean TPOT (ms): 62.73
Median TPOT (ms): 61.61
P99 TPOT (ms): 103.67
---------------Inter-Token Latency----------------
Mean ITL (ms): 62.71
Median ITL (ms): 31.86
P95 ITL (ms): 58.45
P99 ITL (ms): 1374.35
Max ITL (ms): 2338.90
==================================================
results(turbomind_rms_norm):
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: 64.0
Max request concurrency: 64
Successful requests: 256
Benchmark duration (s): 13.15
Total input tokens: 1048576
Total generated tokens: 8192
Total generated tokens (retokenized): 8190
Request throughput (req/s): 19.47
Input token throughput (tok/s): 79747.11
Output token throughput (tok/s): 623.02
Total token throughput (tok/s): 80370.13
Concurrency: 61.68
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 3167.85
Median E2E Latency (ms): 3282.13
---------------Time to First Token----------------
Mean TTFT (ms): 1276.11
Median TTFT (ms): 1334.07
P99 TTFT (ms): 2270.03
--------------Time Per Output Token---------------
Mean TPOT (ms): 61.04
Median TPOT (ms): 57.76
P99 TPOT (ms): 101.03
---------------Inter-Token Latency----------------
Mean ITL (ms): 61.02
Median ITL (ms): 32.03
P95 ITL (ms): 56.19
P99 ITL (ms): 1324.25
Max ITL (ms): 2259.54
==================================================
Checklist
- [x] Format your code according to the Format code with pre-commit.
- [x] Add unit tests according to the Run and add unit tests.
- [ ] Update documentation according to Write documentations.
- [x] Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
[!WARNING] You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!
Looks good. Have you tested it on H100/B200? If not, I can help you add some H100/B200 tests.
Looks good. Have you tested it on H100/B200? If not, I can help you add some H100/B200 tests.
Thanks for reviewing! I don’t have access to H100/B200, so I couldn't test on those. If you could help run the tests, that would be great. Happy to update anything if needed.
H100 results, looks pretty good
rmsnorm-performance(head_dim=128):
head_num token_num SGLang Turbomind
0 16.0 1.0 2.690041 2.643974
1 16.0 2.0 3.132268 2.748590
2 16.0 4.0 3.181508 2.758456
3 16.0 8.0 3.231008 2.785226
4 16.0 16.0 3.379054 2.902990
5 16.0 32.0 3.557637 3.018197
6 16.0 64.0 4.157426 3.067163
7 16.0 128.0 5.527277 3.304978
8 16.0 256.0 8.192670 3.480653
9 16.0 512.0 13.416918 4.106387
10 16.0 1024.0 23.871537 5.459238
11 16.0 2048.0 45.091851 8.214365
12 16.0 4096.0 92.935882 16.025131
13 32.0 1.0 3.148964 2.759750
14 32.0 2.0 3.171174 2.736667
15 32.0 4.0 3.230939 2.813160
16 32.0 8.0 3.362471 2.902649
17 32.0 16.0 3.553130 3.020820
18 32.0 32.0 4.143711 3.062767
19 32.0 64.0 5.547237 3.285565
20 32.0 128.0 8.196443 3.475744
21 32.0 256.0 13.435698 4.111618
22 32.0 512.0 23.900416 5.468766
23 32.0 1024.0 45.054145 8.206142
24 32.0 2048.0 92.890488 16.015399
25 32.0 4096.0 181.953198 46.956430
26 48.0 1.0 3.100425 2.732236
27 48.0 2.0 3.175247 2.762727
28 48.0 4.0 3.273962 2.830548
29 48.0 8.0 3.426947 2.940473
30 48.0 16.0 3.788936 2.997664
31 48.0 32.0 4.828032 3.143320
32 48.0 64.0 6.853608 3.355092
33 48.0 128.0 10.787238 3.841854
34 48.0 256.0 18.648622 4.792144
35 48.0 512.0 34.520374 6.965173
36 48.0 1024.0 69.091418 11.216033
37 48.0 2048.0 137.166762 29.693131
38 48.0 4096.0 271.742450 73.303484