[Feature] Support nvfp4 moe
Motivation
This pr supports modelopt format NVFP4 inference (currently, only Qwen/Ernie) by introducing Flashinfer as a backend.
It requires GPU with sm>=100 and Flashinfer installtion.
Modifications
With paddle compatible api, this pr introdudaces Flashinfer as a backend. There may be coexistence issues with some third-party including pytorch code (e.g. xgrammar, triton). , Currenly we cannot use them at same time and we are working on resolving this.
Usage or Command
new environment variables
-
FD_FLASHINFER_MOE_BACKEND: FP4 MoE backend, could beflashinfer-cutlass,flashinfer-trtllmorNone(default is None, will use flashinfer-cutass). Currently we only supportflashinfer-cutlass. -
FD_NVFP4_GEMM_BACKEND: FP4 dense GEMM backend, could be flashinfer-cutlass, flashinfer-trtllm, flashinfer-cudnn or None (default is None, will use flashinfer-cutlass). Currently we only supportflashinfer-cutlass. -
PADDLE_COMPATIBLE_API: This is an environment variable for Flashinfer with Paddle, set it totrueto use paddle compatible api, default is false.
start the server
export PADDLE_COMPATIBLE_API=true
python -m fastdeploy.entrypoints.openai.api_server \
--model nv-community/Qwen3-30B-A3B-FP4 \
--port 8180 \
--metrics-port 8181 \
--engine-worker-queue-port 8182 \
--cache-queue-port 8183 \
--tensor-parallel-size 1 \
--max-model-len 32768 \
--max-num-seqs 128
Accuracy Tests
Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]] - You can add new tags based on the PR content, but the semantics must be clear.
- Tag list: [
- [x] Format your code, run
pre-commitbefore commit. - [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [x] If the current PR is submitting to the
releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.
Thanks for your contribution!
Codecov Report
:x: Patch coverage is 16.33987% with 256 lines in your changes missing coverage. Please review.
:warning: Please upload report for BASE (develop@a4bb3e9). Learn more about missing BASE report.
Additional details and impacted files
@@ Coverage Diff @@
## develop #4733 +/- ##
==========================================
Coverage ? 59.01%
==========================================
Files ? 326
Lines ? 40361
Branches ? 6091
==========================================
Hits ? 23819
Misses ? 14652
Partials ? 1890
| Flag | Coverage Δ | |
|---|---|---|
| GPU | 59.01% <16.33%> (?) |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.