verl [megatron] fix: BF16 mode should use PAO as well

What does this PR do?

BF16 training should use POA as well to reduce the memory usage of the trainer nodes. This is a lossless operation as by default as the master param are reconstructed in fp32. https://github.com/NVIDIA/Megatron-LM/blob/0634924d1724d52536129128cd85d28b92baf72e/megatron/core/optimizer/optimizer_config.py#L63

Checklist Before Starting

[X] Search for similar PRs. Paste at least one query link here: https://github.com/volcengine/verl/pulls?q=is%3Apr+is%3Aopen+optimizer
[X] Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

[X] Read the Contribute Guide.
[X] Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
[X] Add / Update the documentation.
[ ] Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
[ ] Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)

Nov 21 '25 04:11 ashvinnihalani

@vermouth1992 and @ISEEKYAN if you could take a look. Can't seem to add you as required reviewers.

Nov 21 '25 06:11 ashvinnihalani

Screenshot 2025-11-20 at 10 50 17 PM

Nov 21 '25 06:11 ashvinnihalani

We see around around a 3% memory utlization reduction (around 4GB per GPU) on the examples/grpo_trainer/run_qwen2-7b_sgl_megatron.sh example. It reduces the optimizer in BF16 training from 16 Bytes down to 14 Bytes and this aligns with the theortical savings (14 GB per DP replica, 28 total , ~3.5 per GPU)

Nov 21 '25 07:11 ashvinnihalani

Hmm this doesn't seem to be working correctly:

actor/ppo_kl is always zero, looks like update actor isn't working as both the val loss remains contstant
Toggling store_param_remainders off, doesn't seem to help. Only by disabling POA completely does it seem to return back to normal operation
However in #4086 POA seems to work for fp16. However the kl coeff seems to be turned off and the optimizer is offloaded.

Nov 21 '25 08:11 ashvinnihalani

Hmm this doesn't seem to be working correctly:

actor/ppo_kl is always zero, looks like update actor isn't working as both the val loss remains contstant

Toggling store_param_remainders off, doesn't seem to help. Only by disabling POA completely does it seem to return back to normal operation

However in [megatron] feat: fp16 training (dense and MoE supported) #4086 POA seems to work for fp16. However the kl coeff seems to be turned off and the optimizer is offloaded.

actually in many examples PAO is already adopted as necessary option of cpu-optimizer, such as https://github.com/volcengine/verl/blob/fc7df6f7f99bad09b394463c75bb64ef6a21191b/examples/grpo_trainer/run_qwen3-235b_megatron_96gb.sh#L116-L118, which is documented in MCore https://github.com/NVIDIA/Megatron-LM/blob/e35495d8bed8399ad083af06239f59ac465de7af/megatron/core/optimizer/cpu_offloading/README.md?plain=1#L8

I think to issue that ppo_kl always zero is not related to PAO.

Nov 22 '25 06:11 ISEEKYAN

Hmm this doesn't seem to be working correctly:

actor/ppo_kl is always zero, looks like update actor isn't working as both the val loss remains contstant

Toggling store_param_remainders off, doesn't seem to help. Only by disabling POA completely does it seem to return back to normal operation

However in [megatron] feat: fp16 training (dense and MoE supported) #4086 POA seems to work for fp16. However the kl coeff seems to be turned off and the optimizer is offloaded.

as i know, ppo_kl=0 is a good signal ! it means that ther is no difference between old_log_probs and log_probs, the traning is totally on-policy. and i guess the pg clip fraction must be zero as well?

Nov 22 '25 09:11 Yangruipis

@ISEEKYAN Did this get reverted because of the KL issue? Just wondering if I should reopen it/open up another issues.

Nov 24 '25 19:11 ashvinnihalani