verl icon indicating copy to clipboard operation
verl copied to clipboard

[megatron] fix: BF16 mode should use PAO as well

Open ashvinnihalani opened this issue 1 month ago • 4 comments

What does this PR do?

BF16 training should use POA as well to reduce the memory usage of the trainer nodes. This is a lossless operation as by default as the master param are reconstructed in fp32. https://github.com/NVIDIA/Megatron-LM/blob/0634924d1724d52536129128cd85d28b92baf72e/megatron/core/optimizer/optimizer_config.py#L63

Checklist Before Starting

  • [X] Search for similar PRs. Paste at least one query link here: https://github.com/volcengine/verl/pulls?q=is%3Apr+is%3Aopen+optimizer
  • [X] Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

ashvinnihalani avatar Nov 21 '25 04:11 ashvinnihalani

@vermouth1992 and @ISEEKYAN if you could take a look. Can't seem to add you as required reviewers.

ashvinnihalani avatar Nov 21 '25 06:11 ashvinnihalani

Screenshot 2025-11-20 at 10 50 17 PM

ashvinnihalani avatar Nov 21 '25 06:11 ashvinnihalani

We see around around a 3% memory utlization reduction (around 4GB per GPU) on the examples/grpo_trainer/run_qwen2-7b_sgl_megatron.sh example. It reduces the optimizer in BF16 training from 16 Bytes down to 14 Bytes and this aligns with the theortical savings (14 GB per DP replica, 28 total , ~3.5 per GPU)

ashvinnihalani avatar Nov 21 '25 07:11 ashvinnihalani

Hmm this doesn't seem to be working correctly:

  1. actor/ppo_kl is always zero, looks like update actor isn't working as both the val loss remains contstant
  2. Toggling store_param_remainders off, doesn't seem to help. Only by disabling POA completely does it seem to return back to normal operation
  3. However in #4086 POA seems to work for fp16. However the kl coeff seems to be turned off and the optimizer is offloaded.

ashvinnihalani avatar Nov 21 '25 08:11 ashvinnihalani

Hmm this doesn't seem to be working correctly:

  1. actor/ppo_kl is always zero, looks like update actor isn't working as both the val loss remains contstant
  2. Toggling store_param_remainders off, doesn't seem to help. Only by disabling POA completely does it seem to return back to normal operation
  3. However in [megatron] feat: fp16 training (dense and MoE supported) #4086 POA seems to work for fp16. However the kl coeff seems to be turned off and the optimizer is offloaded.

actually in many examples PAO is already adopted as necessary option of cpu-optimizer, such as https://github.com/volcengine/verl/blob/fc7df6f7f99bad09b394463c75bb64ef6a21191b/examples/grpo_trainer/run_qwen3-235b_megatron_96gb.sh#L116-L118, which is documented in MCore https://github.com/NVIDIA/Megatron-LM/blob/e35495d8bed8399ad083af06239f59ac465de7af/megatron/core/optimizer/cpu_offloading/README.md?plain=1#L8

I think to issue that ppo_kl always zero is not related to PAO.

ISEEKYAN avatar Nov 22 '25 06:11 ISEEKYAN

Hmm this doesn't seem to be working correctly:

  1. actor/ppo_kl is always zero, looks like update actor isn't working as both the val loss remains contstant
  2. Toggling store_param_remainders off, doesn't seem to help. Only by disabling POA completely does it seem to return back to normal operation
  3. However in [megatron] feat: fp16 training (dense and MoE supported) #4086 POA seems to work for fp16. However the kl coeff seems to be turned off and the optimizer is offloaded.

as i know, ppo_kl=0 is a good signal ! it means that ther is no difference between old_log_probs and log_probs, the traning is totally on-policy. and i guess the pg clip fraction must be zero as well?

Yangruipis avatar Nov 22 '25 09:11 Yangruipis

@ISEEKYAN Did this get reverted because of the KL issue? Just wondering if I should reopen it/open up another issues.

ashvinnihalani avatar Nov 24 '25 19:11 ashvinnihalani