verl
verl copied to clipboard
[rollout, vllm] feat: support blockwise fp8 rollout
What does this PR do?
This PR introduces FP8 rollout with vllm inference backend in verl.
We monkey patch several vLLM functions to enable FP8 rollout for reinforcement learning.
- load_weights: A custom
load_weightsfunction to quantize the on-the-fly model weights from a higher-precision format to FP8. - process weights after loading: Replace
vllm.model_executor.layers.quantization.fp8.Fp8LinearMethod.process_weights_after_loadingfunction to handle model weights loading after quantization.
Support Matrix
- FP8 blockwise quantization for rollout
- Used in Deepseek, which is 1x128 quantization for activations and 128x128 quantization for model weights
- Dense models and MoE models
- Async rollout interfaces
- vLLM 0.10.x & vLLM 0.11
- FSDP and Megatron training backends
Experiments and Outcomes
Qwen3-8B-Base Dense Model
Configuration
- DAPO recipe. AIME24 online validation.
- vLLM(FP8 spmd rollout) + FSDP
- Note that SPMD rollout has been deprecated, so we removed the FP8 SPMD rollout.
- Prompt batch size 32, n=16.
- Rollout batch size: 32*3*16
- Train_batch_size & ppo_mini_batch_size 32
- Max response length 20K
- Token-level TIS, C=2
- 8*H100
- vLLM 0.10.0+CUDA 12.6 vs vLLM 0.11.0+CUDA 12.9
Accuracy
dark green: BF16, orange: FP8 rollout + token-level TIS, light green: FP8 rollout without TIS
Results and observations:
- With TIS, FP8 rollout aligns with BF16
- Obvious accuracy drop when TIS is not enabled
- Higher mismatch kl but within acceptable range throughout the training
Performance
green: BF16, orange: FP8 rollout + CUDA12.6 + DeepGemm, purple: FP8 rollout + CUDA 12.9 + DeepGemm
Results and observations:
- FP8 rollout leads to around ~12% rollout speedup with CUDA 12.6 + DeepGemm
- When upgrading to CUDA 12.9, speedup can be up to ~18%
Qwen3-30B-A3B-Base MoE Model
Configuration
- DAPO recipe. AIME24 online validation.
- FP8 async rollout, vLLM+FSDP
- Prompt batch size 32
- Rollout batch size: 32*3*16
- Train_batch_size & ppo_mini_batch_size 32
- Max response length 20K
- Token-level TIS, C=2
- 2*8*H100
- vLLM 0.10.0+CUDA 12.6
Accuracy
grey: BF16 + token-level TIS, red: FP8 rollout + token-level TIS
Results and observations:
- Rollout & training distribution mismatch is in general higher for MoE
- Rollout correction required even for BF16
- FP8 rollout with token-level TIS aligns with BF16
Performance
grey: BF16 + token-level TIS, red: FP8 rollout + token-level TIS
Results and observations:
- FP8 rollout : over 35% rollout speedup
- Expecting more perf gain with CUDA 12.9
Usage
FP8 can be enabled in the config file verl/trainer/config/ppo_megatron_trainer.yaml:
rollout:
quantization: True
use_block_quant_rollout: True
Or it can be enabled by command line:
actor_rollout_ref.rollout.quantization=Trueactor_rollout_ref.rollout.use_block_quant_rollout=True
Plans
- will open another PR to support FP8 rollout in SGLang
- further to enable FP8 training in megatron
Checklist Before Starting
- [x] Search for similar PRs. Paste at least one query link here: WIP: FP8 train
- [x] Format the PR title as
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data- If this PR involves multiple modules, separate them with
,like[megatron, fsdp, doc] {type}is infeat,fix,refactor,chore,test- If this PR breaks any API (CLI arguments, config, function signature, etc.), add
[BREAKING]to the beginning of the title. - Example:
[BREAKING][fsdp, megatron] feat: dynamic batching
Design & Code Changes
Demonstrate the high-level design if this PR is complex, and list the specific changes.
Checklist Before Submitting
[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
- [x] Read the Contribute Guide.
- [x] Apply pre-commit checks:
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always - [x] Add / Update the documentation.
- [ ] Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
- [ ] Once your PR is ready for CI, send a message in the
ci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)