RL icon indicating copy to clipboard operation
RL copied to clipboard

Async checkpoint saving in GRPO loop

Open guyueh1 opened this issue 1 month ago • 1 comments

Is your feature request related to a problem? Please describe. Currently async_save is disabled in mcore path checkpoint, serialization takes a long time with training paused; should test async_save and resolve any issues coming up.

Describe the solution you'd like A clear and concise description of what you want to happen.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

guyueh1 avatar Nov 04 '25 19:11 guyueh1

A summary of the changes CodeRabbit can apply:

  • Implement opt-in async checkpoint saving by changing nemo_rl/models/policy/megatron_policy_worker.py to replace the hardcoded async_save=False with async_save=self.cfg.get("async_checkpoint_save", False) (and adding safety logging and docs in-code), plus add an example config (examples/configs/experimental/async_checkpoint_test.yaml), detailed user/testing documentation (docs/guides/async-checkpoint-saving.md), and unit tests (tests/unit/test_async_checkpoint_config.py) to validate default/off/on behavior and guide safe rollout.

  • Add experimental async checkpoint saving: introduce docs/guides/async-checkpoint-saving.md (new) and an example config examples/configs/experimental/async_checkpoint_test.yaml (new); enable configurable async_save in nemo_rl/models/policy/megatron_policy_worker.py (set async_save from cfg, add finalization calls before saves) and add unit tests tests/unit/test_async_checkpoint_config.py (new) to validate the async_checkpoint_save flag and guard against regressions, with the goal of overlapping checkpoint I/O with training to reduce pause times while requiring careful finalization and testing.

  • [ ] ✅ Create PR with these edits
  • [ ] 📋 Get copyable edits

coderabbitai[bot] avatar Nov 04 '25 19:11 coderabbitai[bot]