verl [rollout, vllm] feat: support blockwise fp8 rollout

[rollout, vllm] feat: support blockwise fp8 rollout

Open Agoniii opened this issue 2 months ago • 1 comments

What does this PR do?

This PR introduces FP8 rollout with vllm inference backend in verl.

We monkey patch several vLLM functions to enable FP8 rollout for reinforcement learning.

load_weights: A custom load_weights function to quantize the on-the-fly model weights from a higher-precision format to FP8.
process weights after loading: Replace vllm.model_executor.layers.quantization.fp8.Fp8LinearMethod.process_weights_after_loading function to handle model weights loading after quantization.

Support Matrix

FP8 blockwise quantization for rollout
- Used in Deepseek, which is 1x128 quantization for activations and 128x128 quantization for model weights
Dense models and MoE models
Async rollout interfaces
vLLM 0.10.x & vLLM 0.11
FSDP and Megatron training backends

Experiments and Outcomes

Qwen3-8B-Base Dense Model

Configuration

DAPO recipe. AIME24 online validation.
vLLM(FP8 spmd rollout) + FSDP
- Note that SPMD rollout has been deprecated, so we removed the FP8 SPMD rollout.
Prompt batch size 32, n=16.
Rollout batch size: 32*3*16
Train_batch_size & ppo_mini_batch_size 32
Max response length 20K
Token-level TIS, C=2
8*H100
vLLM 0.10.0+CUDA 12.6 vs vLLM 0.11.0+CUDA 12.9

Accuracy Qwen3-8b-base_fp8_acc dark green: BF16, orange: FP8 rollout + token-level TIS, light green: FP8 rollout without TIS

Results and observations:

With TIS, FP8 rollout aligns with BF16
Obvious accuracy drop when TIS is not enabled
Higher mismatch kl but within acceptable range throughout the training

Performance

Qwen3-8b-base_fp8_rollout_perf green: BF16, orange: FP8 rollout + CUDA12.6 + DeepGemm, purple: FP8 rollout + CUDA 12.9 + DeepGemm

Results and observations:

FP8 rollout leads to around ~12% rollout speedup with CUDA 12.6 + DeepGemm
When upgrading to CUDA 12.9, speedup can be up to ~18%

Qwen3-30B-A3B-Base MoE Model

Configuration

DAPO recipe. AIME24 online validation.
FP8 async rollout, vLLM+FSDP
Prompt batch size 32
Rollout batch size: 32*3*16
Train_batch_size & ppo_mini_batch_size 32
Max response length 20K
Token-level TIS, C=2
2*8*H100
vLLM 0.10.0+CUDA 12.6

Accuracy Qwen3-30b-a3b_fp8_acc grey: BF16 + token-level TIS, red: FP8 rollout + token-level TIS

Results and observations:

Rollout & training distribution mismatch is in general higher for MoE
Rollout correction required even for BF16
FP8 rollout with token-level TIS aligns with BF16

Performance

Qwen3-30b-a3b_fp8_perf grey: BF16 + token-level TIS, red: FP8 rollout + token-level TIS

Results and observations:

FP8 rollout : over 35% rollout speedup
Expecting more perf gain with CUDA 12.9

Usage

FP8 can be enabled in the config file verl/trainer/config/ppo_megatron_trainer.yaml:

  rollout:
    quantization: True

    use_block_quant_rollout: True

Or it can be enabled by command line:

actor_rollout_ref.rollout.quantization=True
actor_rollout_ref.rollout.use_block_quant_rollout=True

Plans

will open another PR to support FP8 rollout in SGLang
further to enable FP8 training in megatron

Checklist Before Starting

[x] Search for similar PRs. Paste at least one query link here: WIP: FP8 train
[x] Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

[x] Read the Contribute Guide.
[x] Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
[x] Add / Update the documentation.
[ ] Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
[ ] Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)

Sep 18 '25 09:09 Agoniii

All committers have signed the CLA.

Sep 18 '25 09:09 CLAassistant

verl verl copied to clipboard

[rollout, vllm] feat: support blockwise fp8 rollout

What does this PR do?

Support Matrix

Experiments and Outcomes

Qwen3-8B-Base Dense Model

Qwen3-30B-A3B-Base MoE Model

Usage

Plans

Checklist Before Starting

Design & Code Changes

Checklist Before Submitting

verl
verl copied to clipboard