FastDeploy [Feature] Support nvfp4 moe

Motivation

This pr supports modelopt format NVFP4 inference (currently, only Qwen/Ernie) by introducing Flashinfer as a backend.

It requires GPU with sm>=100 and Flashinfer installtion.

Modifications

With paddle compatible api, this pr introdudaces Flashinfer as a backend. There may be coexistence issues with some third-party including pytorch code (e.g. xgrammar, triton). , Currenly we cannot use them at same time and we are working on resolving this.

Usage or Command

new environment variables

FD_FLASHINFER_MOE_BACKEND : FP4 MoE backend, could be flashinfer-cutlass, flashinfer-trtllm or None (default is None, will use flashinfer-cutass). Currently we only support flashinfer-cutlass.
FD_NVFP4_GEMM_BACKEND: FP4 dense GEMM backend, could be flashinfer-cutlass, flashinfer-trtllm, flashinfer-cudnn or None (default is None, will use flashinfer-cutlass). Currently we only support flashinfer-cutlass.
PADDLE_COMPATIBLE_API: This is an environment variable for Flashinfer with Paddle, set it totrue to use paddle compatible api, default is false.

start the server

export PADDLE_COMPATIBLE_API=true
python -m fastdeploy.entrypoints.openai.api_server \
    --model nv-community/Qwen3-30B-A3B-FP4 \
    --port 8180 \
    --metrics-port 8181 \
    --engine-worker-queue-port 8182 \
    --cache-queue-port 8183 \
    --tensor-parallel-size 1 \
    --max-model-len  32768 \
    --max-num-seqs 128

Accuracy Tests

Checklist

[x] Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
[x] Format your code, run pre-commit before commit.
[x] Add unit tests. Please write the reason in this PR if no unit tests.
[ ] Provide accuracy results.
[x] If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

Oct 31 '25 08:10 zoooo0820

Thanks for your contribution!

Oct 31 '25 08:10 paddle-bot[bot]

All committers have signed the CLA.

Nov 06 '25 10:11 CLAassistant

Codecov Report

:x: Patch coverage is 16.33987% with 256 lines in your changes missing coverage. Please review. :warning: Please upload report for BASE (develop@a4bb3e9). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
...deploy/model_executor/layers/quantization/nvfp4.py	12.89%	222 Missing and 1 partial :warning:
fastdeploy/model_executor/layers/moe/moe.py	20.00%	19 Missing and 1 partial :warning:
fastdeploy/flashinfer.py	62.50%	5 Missing and 1 partial :warning:
...loy/model_executor/layers/quantization/__init__.py	20.00%	4 Missing :warning:
fastdeploy/model_executor/utils.py	25.00%	3 Missing :warning:

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #4733   +/-   ##
==========================================
  Coverage           ?   59.01%           
==========================================
  Files              ?      326           
  Lines              ?    40361           
  Branches           ?     6091           
==========================================
  Hits               ?    23819           
  Misses             ?    14652           
  Partials           ?     1890

Flag	Coverage Δ
GPU	`59.01% <16.33%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Dec 03 '25 12:12 codecov-commenter