Async checkpoint saving in GRPO loop
Is your feature request related to a problem? Please describe. Currently async_save is disabled in mcore path checkpoint, serialization takes a long time with training paused; should test async_save and resolve any issues coming up.
Describe the solution you'd like A clear and concise description of what you want to happen.
Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.
Additional context Add any other context or screenshots about the feature request here.
A summary of the changes CodeRabbit can apply:
Implement opt-in async checkpoint saving by changing nemo_rl/models/policy/megatron_policy_worker.py to replace the hardcoded async_save=False with async_save=self.cfg.get("async_checkpoint_save", False) (and adding safety logging and docs in-code), plus add an example config (examples/configs/experimental/async_checkpoint_test.yaml), detailed user/testing documentation (docs/guides/async-checkpoint-saving.md), and unit tests (tests/unit/test_async_checkpoint_config.py) to validate default/off/on behavior and guide safe rollout.
Add experimental async checkpoint saving: introduce docs/guides/async-checkpoint-saving.md (new) and an example config examples/configs/experimental/async_checkpoint_test.yaml (new); enable configurable async_save in nemo_rl/models/policy/megatron_policy_worker.py (set async_save from cfg, add finalization calls before saves) and add unit tests tests/unit/test_async_checkpoint_config.py (new) to validate the async_checkpoint_save flag and guard against regressions, with the goal of overlapping checkpoint I/O with training to reduce pause times while requiring careful finalization and testing.
- [ ] ✅ Create PR with these edits
- [ ] 📋 Get copyable edits