[WIP][FSDP,Training] feat: Support FSDP2 FP8 training
What does this PR do?
This PR enables FP8 training support using FSDP2 (integrated with torchao) on the latest codebase.
This work is based on the previous attempt in PR https://github.com/volcengine/verl/pull/1490 by @ horsebridge. Since the architecture of verl has evolved significantly, the original PR had conflicts with the current main branch. This PR ports the implementation to align with the latest architecture and re-enables the FP8 capability.
TODO List:
- [ ] We are currently conducting experiments on FSDP2 FP8 training combined with FP8 rollout (based on SGLang). The experimental results and verification details will be updated here once available.
Checklist Before Starting
- [ ] Search for similar PRs. Paste at least one query link here: ...
- [ ] Format the PR title as
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data- If this PR involves multiple modules, separate them with
,like[megatron, fsdp, doc] {type}is infeat,fix,refactor,chore,test- If this PR breaks any API (CLI arguments, config, function signature, etc.), add
[BREAKING]to the beginning of the title. - Example:
[BREAKING][fsdp, megatron] feat: dynamic batching
Test
For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.
API and Usage Example
Demonstrate how the API changes if any, and provide usage example(s) if possible.
actor_rollout_ref:
actor:
# Note: FP8 training currently requires fsdp2 strategy
strategy: fsdp2
fsdp_config:
fp8: True
critic:
# Note: FP8 training currently requires fsdp2 strategy
strategy: fsdp2
model:
fsdp_config:
fp8: True
Design & Code Changes
Demonstrate the high-level design if this PR is complex, and list the specific changes.
Checklist Before Submitting
[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
- [ ] Read the Contribute Guide.
- [ ] Apply pre-commit checks:
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always - [ ] Add / Update the documentation.
- [ ] Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
- [ ] Once your PR is ready for CI, send a message in the
ci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)