RL Hopper FP8 GRPO Recipe Productization

Is your feature request related to a problem? Please describe.

Release code for FP8 rollout + FP8 training (blockwise scaling)

Convergence test and downstream evaluation compared w/ BF16 for llama3-8B
Good performance (+20% BF16 for llama3-8B)

Describe the solution you'd like

FP8 rollout via Vllm
FP8 training via Megatron TE backend

Describe alternatives you've considered

N/A

Additional context

N/A

Aug 20 '25 02:08 guyueh1

@guyueh1 to perform FP8 GRPO convergence test and downstream evaluation

Aug 20 '25 02:08 guyueh1

@guyueh1 can we also add a large model like 70B ? @joyang-nv we also need FP8 policy in the Dtensor path. we should enable this after we move to Automodel which already has FP8 support

Aug 20 '25 18:08 euronymous-aithal

Great to know Automodel has already supported FP8.

Aug 21 '25 07:08 joyang-nv

Here is the latest status of FP8 GRPO support in nemo-rl

	GRPO		SFT
	llama3	Qwen3MoE	llama3	Qwen3MoE
FP8 block-quant	https://github.com/NVIDIA-NeMo/RL/pull/971	https://github.com/NVIDIA-NeMo/RL/pull/1175	https://github.com/NVIDIA-NeMo/RL/pull/971	https://github.com/NVIDIA-NeMo/RL/pull/971
FP8 per-tensor	todo	todo	https://github.com/NVIDIA-NeMo/RL/pull/971	https://github.com/NVIDIA-NeMo/RL/pull/971

Performance optimization for FP8 block-quant GRPO is still ongoing.

The short term goal is to enable users to run e2e-FP8 GRPO with block-quant, this is the recommended recipe for H100; Then we will move to implement blackwell recipe.

Sep 22 '25 18:09 guyueh1