verl icon indicating copy to clipboard operation
verl copied to clipboard

[rollout, vllm] feat: support blockwise fp8 rollout

Open Agoniii opened this issue 2 months ago • 1 comments

What does this PR do?

This PR introduces FP8 rollout with vllm inference backend in verl.

We monkey patch several vLLM functions to enable FP8 rollout for reinforcement learning.

  1. load_weights: A custom load_weights function to quantize the on-the-fly model weights from a higher-precision format to FP8.
  2. process weights after loading: Replace vllm.model_executor.layers.quantization.fp8.Fp8LinearMethod.process_weights_after_loading function to handle model weights loading after quantization.

Support Matrix

  • FP8 blockwise quantization for rollout
    • Used in Deepseek, which is 1x128 quantization for activations and 128x128 quantization for model weights
  • Dense models and MoE models
  • Async rollout interfaces
  • vLLM 0.10.x & vLLM 0.11
  • FSDP and Megatron training backends

Experiments and Outcomes

Qwen3-8B-Base Dense Model

Configuration

  • DAPO recipe. AIME24 online validation.
  • vLLM(FP8 spmd rollout) + FSDP
    • Note that SPMD rollout has been deprecated, so we removed the FP8 SPMD rollout.
  • Prompt batch size 32, n=16.
  • Rollout batch size: 32*3*16
  • Train_batch_size & ppo_mini_batch_size 32
  • Max response length 20K
  • Token-level TIS, C=2
  • 8*H100
  • vLLM 0.10.0+CUDA 12.6 vs vLLM 0.11.0+CUDA 12.9

Accuracy Qwen3-8b-base_fp8_acc dark green: BF16, orange: FP8 rollout + token-level TIS, light green: FP8 rollout without TIS

Results and observations:

  • With TIS, FP8 rollout aligns with BF16
  • Obvious accuracy drop when TIS is not enabled
  • Higher mismatch kl but within acceptable range throughout the training

Performance

Qwen3-8b-base_fp8_rollout_perf green: BF16, orange: FP8 rollout + CUDA12.6 + DeepGemm, purple: FP8 rollout + CUDA 12.9 + DeepGemm

Results and observations:

  • FP8 rollout leads to around ~12% rollout speedup with CUDA 12.6 + DeepGemm
  • When upgrading to CUDA 12.9, speedup can be up to ~18%

Qwen3-30B-A3B-Base MoE Model

Configuration

  • DAPO recipe. AIME24 online validation.
  • FP8 async rollout, vLLM+FSDP
  • Prompt batch size 32
  • Rollout batch size: 32*3*16
  • Train_batch_size & ppo_mini_batch_size 32
  • Max response length 20K
  • Token-level TIS, C=2
  • 2*8*H100
  • vLLM 0.10.0+CUDA 12.6

Accuracy Qwen3-30b-a3b_fp8_acc grey: BF16 + token-level TIS, red: FP8 rollout + token-level TIS

Results and observations:

  • Rollout & training distribution mismatch is in general higher for MoE
  • Rollout correction required even for BF16
  • FP8 rollout with token-level TIS aligns with BF16

Performance

Qwen3-30b-a3b_fp8_perf grey: BF16 + token-level TIS, red: FP8 rollout + token-level TIS​

Results and observations:

  • FP8 rollout : over 35% rollout speedup
  • Expecting more perf gain with CUDA 12.9

Usage

FP8 can be enabled in the config file verl/trainer/config/ppo_megatron_trainer.yaml:

  rollout:
    quantization: True

    use_block_quant_rollout: True

Or it can be enabled by command line:

  • actor_rollout_ref.rollout.quantization=True
  • actor_rollout_ref.rollout.use_block_quant_rollout=True

Plans

  • will open another PR to support FP8 rollout in SGLang
  • further to enable FP8 training in megatron

Checklist Before Starting

  • [x] Search for similar PRs. Paste at least one query link here: WIP: FP8 train
  • [x] Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Agoniii avatar Sep 18 '25 09:09 Agoniii

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Sep 18 '25 09:09 CLAassistant