Async checkpoint saving in GRPO loop

Open guyueh1 opened this issue 1 month ago • 1 comments

Is your feature request related to a problem? Please describe. Currently async_save is disabled in mcore path checkpoint, serialization takes a long time with training paused; should test async_save and resolve any issues coming up.

Describe the solution you'd like A clear and concise description of what you want to happen.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

Nov 04 '25 19:11 guyueh1

A summary of the changes CodeRabbit can apply:

Implement opt-in async checkpoint saving by changing nemo_rl/models/policy/megatron_policy_worker.py to replace the hardcoded async_save=False with async_save=self.cfg.get("async_checkpoint_save", False) (and adding safety logging and docs in-code), plus add an example config (examples/configs/experimental/async_checkpoint_test.yaml), detailed user/testing documentation (docs/guides/async-checkpoint-saving.md), and unit tests (tests/unit/test_async_checkpoint_config.py) to validate default/off/on behavior and guide safe rollout.

Add experimental async checkpoint saving: introduce docs/guides/async-checkpoint-saving.md (new) and an example config examples/configs/experimental/async_checkpoint_test.yaml (new); enable configurable async_save in nemo_rl/models/policy/megatron_policy_worker.py (set async_save from cfg, add finalization calls before saves) and add unit tests tests/unit/test_async_checkpoint_config.py (new) to validate the async_checkpoint_save flag and guard against regressions, with the goal of overlapping checkpoint I/O with training to reduce pause times while requiring careful finalization and testing.

[ ] ✅ Create PR with these edits
[ ] 📋 Get copyable edits

Nov 04 '25 19:11 coderabbitai[bot]