feat: support flashinfer kernel autotune
Motivation
Flashinfer MoE requires autotune to select a most performant kernel. This should be done before cudagraph captures.
Modifications
This PR added a kernel warmup stage before the cudagraph captures. In the warmup stage, it will run a dummy forward for the model under the Flashinfer autotune context.
Accuracy Tests
lm_eval --model local-completions --tasks gsm8k --model_args model=openai/gpt-oss-120b,base_url=http://127.0.0.1:18000/v1/completions,max_retries=3,tokenized_requests=False,timeout=1200,max_gen_toks=2048,max_length=8192 --batch_size 2048 --trust_remote_code --limit 0.5
PR:
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8394|± |0.0143|
| | |strict-match | 5|exact_match|↑ |0.6318|± |0.0188|
main:
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8394|± |0.0143|
| | |strict-match | 5|exact_match|↑ |0.6318|± |0.0188|
Benchmarking and Profiling
python3 -m sglang.bench_serving --model openai/gpt-oss-120b --host 127.0.0.1 --port 18000 --backend sglang-oai --dataset-name random --random-range-ratio 1 --random-input-len 1024 --random-output-len 1024 --max-concurrency 512 --num-prompts 2560
PR(16% perf improvement):
============ Serving Benchmark Result ============
Backend: sglang-oai
Traffic request rate: inf
Max request concurrency: 512
Successful requests: 2560
Benchmark duration (s): 138.33
Total input tokens: 2621440
Total input text tokens: 2621440
Total input vision tokens: 0
Total generated tokens: 2621440
Total generated tokens (retokenized): 2539762
Request throughput (req/s): 18.51
Input token throughput (tok/s): 18950.11
Output token throughput (tok/s): 18950.11
Total token throughput (tok/s): 37900.23
Concurrency: 510.27
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 27573.10
Median E2E Latency (ms): 27586.49
---------------Time to First Token----------------
Mean TTFT (ms): 2658.05
Median TTFT (ms): 2622.65
P99 TTFT (ms): 4843.49
---------------Inter-Token Latency----------------
Mean ITL (ms): 24.43
Median ITL (ms): 21.66
P95 ITL (ms): 24.23
P99 ITL (ms): 179.70
Max ITL (ms): 4049.93
==================================================
main:
============ Serving Benchmark Result ============
Backend: sglang-oai
Traffic request rate: inf
Max request concurrency: 512
Successful requests: 2560
Benchmark duration (s): 161.57
Total input tokens: 2621440
Total input text tokens: 2621440
Total input vision tokens: 0
Total generated tokens: 2621440
Total generated tokens (retokenized): 2539762
Request throughput (req/s): 15.84
Input token throughput (tok/s): 16225.25
Output token throughput (tok/s): 16225.25
Total token throughput (tok/s): 32450.49
Concurrency: 510.02
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 32188.29
Median E2E Latency (ms): 32219.19
---------------Time to First Token----------------
Mean TTFT (ms): 2632.67
Median TTFT (ms): 2608.20
P99 TTFT (ms): 4823.79
---------------Inter-Token Latency----------------
Mean ITL (ms): 28.98
Median ITL (ms): 26.49
P95 ITL (ms): 28.89
P99 ITL (ms): 154.62
Max ITL (ms): 4009.70
==================================================
Checklist
- [ ] Format your code according to the Format code with pre-commit.
- [ ] Add unit tests according to the Run and add unit tests.
- [ ] Update documentation according to Write documentations.
- [ ] Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
[!WARNING] You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!
Great work! May I ask how long a single tuning run takes now? Is there a switch to control whether the kernel is tuned?
@FlamingoPg
May I ask how long a single tuning run takes now?
For B200+gpt-oss-120b, it takes about 1 min from my local test:
[2025-10-29 02:42:01] Running FlashInfer autotune...
2025-10-29 02:42:01,952 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-10-29 02:42:57,165 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
[2025-10-29 02:42:57] FlashInfer autotune completed.
Is there a switch to control whether the kernel is tuned?
The current design from flashinfer is just running one dummy forward iteration on max batch size with flashinfer context, if there is any flashinfer apis called in the iteration(e.g. trtllm_fp4_block_scale_moe) then flashinfer will do autotune for these apis to find the best kernel.
@FlamingoPg Could you review this PR? Thanks!
@FlamingoPg Could you review this PR? Thanks!
Looks fine, let's wait for the CI
@FlamingoPg There are quite a few CI failures. Are all of them caused by this PR? Or are some of them known failing issues?
I will help you rerun failed jobs
Hi @FlamingoPg, It seems that CI failures are not related to my PR. Could you help confirm? Thanks!
Hi @FlamingoPg, It seems that CI failures are not related to my PR. Could you help confirm? Thanks!
Sure
@FlamingoPg Could you check the remaining failures? thanks!
@FlamingoPg Could you check the remaining failures? thanks!
Sure
[2025-11-18 13:28:40 TP0] Scheduler hit an exception: Traceback (most recent call last):
File "/public_sglang_ci/runner-l1a-gpu-4567/_work/sglang/sglang/python/sglang/srt/managers/scheduler.py", line 2712, in run_scheduler_process
scheduler = Scheduler(
File "/public_sglang_ci/runner-l1a-gpu-4567/_work/sglang/sglang/python/sglang/srt/managers/scheduler.py", line 312, in __init__
self.tp_worker = TpModelWorker(
File "/public_sglang_ci/runner-l1a-gpu-4567/_work/sglang/sglang/python/sglang/srt/managers/tp_worker.py", line 237, in __init__
self._model_runner = ModelRunner(
File "/public_sglang_ci/runner-l1a-gpu-4567/_work/sglang/sglang/python/sglang/srt/model_executor/model_runner.py", line 340, in __init__
self.initialize(min_per_gpu_memory)
File "/public_sglang_ci/runner-l1a-gpu-4567/_work/sglang/sglang/python/sglang/srt/model_executor/model_runner.py", line 507, in initialize
self.kernel_warmup()
File "/public_sglang_ci/runner-l1a-gpu-4567/_work/sglang/sglang/python/sglang/srt/model_executor/model_runner.py", line 2033, in kernel_warmup
self._flashinfer_autotune()
File "/public_sglang_ci/runner-l1a-gpu-4567/_work/sglang/sglang/python/sglang/srt/model_executor/model_runner.py", line 2061, in _flashinfer_autotune
self._dummy_run(batch_size=self.req_to_token_pool.size)
File "/public_sglang_ci/runner-l1a-gpu-4567/_work/sglang/sglang/python/sglang/srt/model_executor/model_runner.py", line 2298, in _dummy_run
self.attn_backend.init_forward_metadata(forward_batch)
File "/public_sglang_ci/runner-l1a-gpu-4567/_work/sglang/sglang/python/sglang/srt/layers/attention/hybrid_linear_attn_backend.py", line 838, in init_forward_metadata
attn_backend.init_forward_metadata(forward_batch)
File "/public_sglang_ci/runner-l1a-gpu-4567/_work/sglang/sglang/python/sglang/srt/layers/attention/hybrid_linear_attn_backend.py", line 123, in init_forward_metadata
self.forward_metadata = self._forward_metadata(forward_batch)
File "/public_sglang_ci/runner-l1a-gpu-4567/_work/sglang/sglang/python/sglang/srt/layers/attention/hybrid_linear_attn_backend.py", line 99, in _forward_metadata
retrieve_parent_token = torch.empty_like(retrieve_next_token)
TypeError: empty_like(): argument 'input' (position 1) must be Tensor, not NoneType
[2025-11-18 13:28:40] Received sigquit from a child process. It usually means the child failed.
https://github.com/sgl-project/sglang/actions/runs/19458190143/job/55706006309?pr=12306
@elvischenv do you think this failure is related to this PR?
https://github.com/sgl-project/sglang/actions/runs/19458190143/job/55706006309?pr=12306 @elvischenv do you think this failure is related to this PR?
Should be related to a PR that merged 2 weeks ago: #11133. Pushed a fix and let see if it works.
The two GPU pipeline failures seem to be caused by OOM, not related to this PR. @FlamingoPg could you re-run these two pipelines? Thanks!
Does auto-tuning also work well for low-latency cases? Or could we control this feature using server parameters?
What does "work well" mean? At least it should not cause performance regression, right?
But I agree that we can add a server flag to allow users to disable autotuning if they want.
Could you provide your launch commands? I will try to reproduce the results later.
Waiting for the CI to pass.
Hi @Qiaolin-Yu @FlamingoPg @Fridge003, could you help us to merge this PR? The CI failures are all unrelated. Thanks!
Hi @Qiaolin-Yu @FlamingoPg @Fridge003, could you help us to merge this PR? The CI failures are all unrelated. Thanks!
@Kangyan-Zhou will help to merge this after all Nvidia CI pass.
Test failures:
- models/test_qwen3_next_models.py: accuracy is 0.0
- test_gpt_oss_4gpu.py: Looks like infra issue:
FileNotFoundError: No such file or directory: /hf_home/hub/models--openai--gpt-oss-120b/snapshots/b5c939de8f754692c1647ca79fbf85e8c1e70f8a/model-00000-of-00014.safetensors - test_deepseek_v3_fp4_4gpu.py: accuracy is 0.0
AssertionError: np.float64(0.0) not greater than 0.94 - quant/test_w4a8_deepseek_v3.py: timeout with
ImportError: DeepEP is not installed. Please install DeepEP package from https://github.com/deepseek-ai/deepep.. Looks like infra issue? - ep/test_deepep_large.py: timeout after 1200 secs. ~10 mins are spent in:
Capturing batches (bs=256 avail_mem=17.16 GB): 0%| | 0/1 [00:26<?, ?it/s]
[2025-11-27 06:51:25 DP0 TP0 EP0] Registering 0 cuda graph addresses
E[2025-11-27 06:59:56] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-11-27 06:59:56] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
and timeout at:
Capturing batches (bs=64 avail_mem=15.67 GB): 0%| | 0/1 [00:26<?, ?it/s]
why does capturing cuda graphs take that long?
- test_quantization.py: Mixtral-8x7B-Instruct-v0.1-AWQ-INT4 has slightly lower accuracy (0.612) than target 0.62
Test failures:
- models/test_qwen3_next_models.py: accuracy is 0.0
- test_gpt_oss_4gpu.py: Looks like infra issue:
FileNotFoundError: No such file or directory: /hf_home/hub/models--openai--gpt-oss-120b/snapshots/b5c939de8f754692c1647ca79fbf85e8c1e70f8a/model-00000-of-00014.safetensors- test_deepseek_v3_fp4_4gpu.py: accuracy is 0.0
AssertionError: np.float64(0.0) not greater than 0.94- quant/test_w4a8_deepseek_v3.py: timeout with
ImportError: DeepEP is not installed. Please install DeepEP package from https://github.com/deepseek-ai/deepep.. Looks like infra issue?- ep/test_deepep_large.py: timeout after 1200 secs. ~10 mins are spent in:
Capturing batches (bs=256 avail_mem=17.16 GB): 0%| | 0/1 [00:26<?, ?it/s] [2025-11-27 06:51:25 DP0 TP0 EP0] Registering 0 cuda graph addresses E[2025-11-27 06:59:56] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable. [2025-11-27 06:59:56] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.and timeout at:
Capturing batches (bs=64 avail_mem=15.67 GB): 0%| | 0/1 [00:26<?, ?it/s]why does capturing cuda graphs take that long?
- test_quantization.py: Mixtral-8x7B-Instruct-v0.1-AWQ-INT4 has slightly lower accuracy (0.612) than target 0.62
We've fixed a few test failure issue on main due to a number of issues recently. Sry for the delay, retrying now