Hopper FP8 GRPO Recipe Productization
Is your feature request related to a problem? Please describe.
Release code for FP8 rollout + FP8 training (blockwise scaling)
- Convergence test and downstream evaluation compared w/ BF16 for llama3-8B
- Good performance (+20% BF16 for llama3-8B)
Describe the solution you'd like
- FP8 rollout via Vllm
- FP8 training via Megatron TE backend
Describe alternatives you've considered
N/A
Additional context
N/A
@guyueh1 to perform FP8 GRPO convergence test and downstream evaluation
@guyueh1 can we also add a large model like 70B ? @joyang-nv we also need FP8 policy in the Dtensor path. we should enable this after we move to Automodel which already has FP8 support
Great to know Automodel has already supported FP8.
Here is the latest status of FP8 GRPO support in nemo-rl
| GRPO | SFT | |||
|---|---|---|---|---|
| llama3 | Qwen3MoE | llama3 | Qwen3MoE | |
| FP8 block-quant | https://github.com/NVIDIA-NeMo/RL/pull/971 | https://github.com/NVIDIA-NeMo/RL/pull/1175 | https://github.com/NVIDIA-NeMo/RL/pull/971 | https://github.com/NVIDIA-NeMo/RL/pull/971 |
| FP8 per-tensor | todo | todo | https://github.com/NVIDIA-NeMo/RL/pull/971 | https://github.com/NVIDIA-NeMo/RL/pull/971 |
Performance optimization for FP8 block-quant GRPO is still ongoing.
The short term goal is to enable users to run e2e-FP8 GRPO with block-quant, this is the recommended recipe for H100; Then we will move to implement blackwell recipe.