vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Performance][Fix] update nvfp4 code to support renorm routing

Open jiahanc opened this issue 1 month ago • 0 comments

Purpose

Fixes https://github.com/vllm-project/vllm/pull/28007

  • Add multi routing method to flashinfer fp4 trtllm moe to support models like Qwen3
  • Add flashinfer trtllm moe into global_sf list which was missed

Test Plan

VLLM_USE_FLASHINFER_MOE_FP4=1 VLLM_FLASHINFER_MOE_BACKEND=latency vllm serve nvidia/Qwen3-235B-A22B-FP4   --max-num-batched-tokens 8192     --max-model-len 16384     --no-enable-prefix-caching     --cuda_graph_sizes 1024     --async-scheduling  -tp 2   --enable-expert-parallel
lm_eval --model local-completions --tasks gsm8k --model_args model=nvidia/Qwen3-235B-A22B-FP4,base_url=http://0.0.0.0:8000/v1/completions,max_retries=3,tokenized_requests=False,timeout=1200,max_gen_toks=2048,max_length=8192 --batch_size 2048 --trust_remote_code --limit 0.5

Test Result

[2025-11-12 20:50:33] INFO evaluation_tracker.py:280: Output path not provided, skipping saving results aggregated
local-completions (model=nvidia/Qwen3-235B-A22B-FP4,base_url=http://0.0.0.0:8000/v1/completions,max_retries=3,tokenized_requests=False,timeout=1200,max_gen_toks=2048,max_length=8192,trust_remote_code=True), gen_kwargs: (None), limit: 0.5, num_fewshot: None, batch_size: 2048
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9348|±  |0.0096|
|     |       |strict-match    |     5|exact_match|↑  |0.9348|±  |0.0096|

Essential Elements of an Effective PR Description Checklist
  • [ ] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • [ ] The test plan, such as providing test command.
  • [ ] The test results, such as pasting the results comparison before and after, or e2e results
  • [ ] (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • [ ] (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

jiahanc avatar Nov 12 '25 17:11 jiahanc