RL icon indicating copy to clipboard operation
RL copied to clipboard

Hopper FP8 GRPO Recipe Productization

Open guyueh1 opened this issue 4 months ago • 4 comments

Is your feature request related to a problem? Please describe.

Release code for FP8 rollout + FP8 training (blockwise scaling)

  • Convergence test and downstream evaluation compared w/ BF16 for llama3-8B
  • Good performance (+20% BF16 for llama3-8B)

Describe the solution you'd like

  • FP8 rollout via Vllm
  • FP8 training via Megatron TE backend

Describe alternatives you've considered

N/A

Additional context

N/A

guyueh1 avatar Aug 20 '25 02:08 guyueh1

@guyueh1 to perform FP8 GRPO convergence test and downstream evaluation

guyueh1 avatar Aug 20 '25 02:08 guyueh1

@guyueh1 can we also add a large model like 70B ? @joyang-nv we also need FP8 policy in the Dtensor path. we should enable this after we move to Automodel which already has FP8 support

euronymous-aithal avatar Aug 20 '25 18:08 euronymous-aithal

Great to know Automodel has already supported FP8.

joyang-nv avatar Aug 21 '25 07:08 joyang-nv

Here is the latest status of FP8 GRPO support in nemo-rl

GRPO SFT
llama3 Qwen3MoE llama3 Qwen3MoE
FP8 block-quant https://github.com/NVIDIA-NeMo/RL/pull/971 https://github.com/NVIDIA-NeMo/RL/pull/1175 https://github.com/NVIDIA-NeMo/RL/pull/971 https://github.com/NVIDIA-NeMo/RL/pull/971
FP8 per-tensor todo todo https://github.com/NVIDIA-NeMo/RL/pull/971 https://github.com/NVIDIA-NeMo/RL/pull/971

Performance optimization for FP8 block-quant GRPO is still ongoing.

The short term goal is to enable users to run e2e-FP8 GRPO with block-quant, this is the recommended recipe for H100; Then we will move to implement blackwell recipe.

guyueh1 avatar Sep 22 '25 18:09 guyueh1