vllm
vllm copied to clipboard
[Performance][Fix] update nvfp4 code to support renorm routing
Purpose
Fixes https://github.com/vllm-project/vllm/pull/28007
- Add multi routing method to flashinfer fp4 trtllm moe to support models like Qwen3
- Add flashinfer trtllm moe into global_sf list which was missed
Test Plan
VLLM_USE_FLASHINFER_MOE_FP4=1 VLLM_FLASHINFER_MOE_BACKEND=latency vllm serve nvidia/Qwen3-235B-A22B-FP4 --max-num-batched-tokens 8192 --max-model-len 16384 --no-enable-prefix-caching --cuda_graph_sizes 1024 --async-scheduling -tp 2 --enable-expert-parallel
lm_eval --model local-completions --tasks gsm8k --model_args model=nvidia/Qwen3-235B-A22B-FP4,base_url=http://0.0.0.0:8000/v1/completions,max_retries=3,tokenized_requests=False,timeout=1200,max_gen_toks=2048,max_length=8192 --batch_size 2048 --trust_remote_code --limit 0.5
Test Result
[2025-11-12 20:50:33] INFO evaluation_tracker.py:280: Output path not provided, skipping saving results aggregated
local-completions (model=nvidia/Qwen3-235B-A22B-FP4,base_url=http://0.0.0.0:8000/v1/completions,max_retries=3,tokenized_requests=False,timeout=1200,max_gen_toks=2048,max_length=8192,trust_remote_code=True), gen_kwargs: (None), limit: 0.5, num_fewshot: None, batch_size: 2048
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9348|± |0.0096|
| | |strict-match | 5|exact_match|↑ |0.9348|± |0.0096|
Essential Elements of an Effective PR Description Checklist
- [ ] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
- [ ] The test plan, such as providing test command.
- [ ] The test results, such as pasting the results comparison before and after, or e2e results
- [ ] (Optional) The necessary documentation update, such as updating
supported_models.mdandexamplesfor a new model. - [ ] (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.