FastDeploy icon indicating copy to clipboard operation
FastDeploy copied to clipboard

[Feature] Support nvfp4 moe

Open zoooo0820 opened this issue 2 months ago • 3 comments

Motivation

This pr supports modelopt format NVFP4 inference (currently, only Qwen/Ernie) by introducing Flashinfer as a backend.

It requires GPU with sm>=100 and Flashinfer installtion.

Modifications

With paddle compatible api, this pr introdudaces Flashinfer as a backend. There may be coexistence issues with some third-party including pytorch code (e.g. xgrammar, triton). , Currenly we cannot use them at same time and we are working on resolving this.

Usage or Command

new environment variables

  • FD_FLASHINFER_MOE_BACKEND : FP4 MoE backend, could be flashinfer-cutlass, flashinfer-trtllm or None (default is None, will use flashinfer-cutass). Currently we only support flashinfer-cutlass.

  • FD_NVFP4_GEMM_BACKEND: FP4 dense GEMM backend, could be flashinfer-cutlass, flashinfer-trtllm, flashinfer-cudnn or None (default is None, will use flashinfer-cutlass). Currently we only support flashinfer-cutlass.

  • PADDLE_COMPATIBLE_API: This is an environment variable for Flashinfer with Paddle, set it totrue to use paddle compatible api, default is false.

start the server

export PADDLE_COMPATIBLE_API=true
python -m fastdeploy.entrypoints.openai.api_server \
    --model nv-community/Qwen3-30B-A3B-FP4 \
    --port 8180 \
    --metrics-port 8181 \
    --engine-worker-queue-port 8182 \
    --cache-queue-port 8183 \
    --tensor-parallel-size 1 \
    --max-model-len  32768 \
    --max-num-seqs 128

Accuracy Tests

Checklist

  • [x] Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • [x] Format your code, run pre-commit before commit.
  • [x] Add unit tests. Please write the reason in this PR if no unit tests.
  • [ ] Provide accuracy results.
  • [x] If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

zoooo0820 avatar Oct 31 '25 08:10 zoooo0820

Thanks for your contribution!

paddle-bot[bot] avatar Oct 31 '25 08:10 paddle-bot[bot]

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Nov 06 '25 10:11 CLAassistant

Codecov Report

:x: Patch coverage is 16.33987% with 256 lines in your changes missing coverage. Please review. :warning: Please upload report for BASE (develop@a4bb3e9). Learn more about missing BASE report.

Files with missing lines Patch % Lines
...deploy/model_executor/layers/quantization/nvfp4.py 12.89% 222 Missing and 1 partial :warning:
fastdeploy/model_executor/layers/moe/moe.py 20.00% 19 Missing and 1 partial :warning:
fastdeploy/flashinfer.py 62.50% 5 Missing and 1 partial :warning:
...loy/model_executor/layers/quantization/__init__.py 20.00% 4 Missing :warning:
fastdeploy/model_executor/utils.py 25.00% 3 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #4733   +/-   ##
==========================================
  Coverage           ?   59.01%           
==========================================
  Files              ?      326           
  Lines              ?    40361           
  Branches           ?     6091           
==========================================
  Hits               ?    23819           
  Misses             ?    14652           
  Partials           ?     1890           
Flag Coverage Δ
GPU 59.01% <16.33%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov-commenter avatar Dec 03 '25 12:12 codecov-commenter