tutorials
tutorials copied to clipboard
[DCP] Add DefaultStager example to distributed async checkpoint recipe
Fixes #3710
Description
This PR updates the distributed_async_checkpoint_recipe to include the DefaultStager functionality introduced in PyTorch 2.9.
Motivation:
In large-scale training, even with standard async_save, the initial memory copy (Staging phase, GPU -> CPU) occurs on the main thread. This blocks the training loop. This PR introduces DefaultStager, which offloads this copy to a background thread, enabling full computation-communication overlap.
Key Changes:
- New Section: Added "Fully Asynchronous Staging with DefaultStager" to the recipe.
-
Version Note: Added
.. versionadded:: 2.9to indicate version requirements. -
Advanced Example: Provided a code example demonstrating how to overlap the D2H copy with the entire Forward and Backward pass:
- Check
staging_completionafter backward but beforeoptimizer.step()to ensure data consistency while maximizing parallel execution. - Check
upload_completionbefore the next save to manage memory backpressure.
- Check
- Authors: Added myself to the author list.
Checklist
- [x] The issue that is being fixed is referred in the description (see above "Fixes #ISSUE_NUMBER")
- [x] Only one issue is addressed in this pull request
- [x] Labels from the issue that this PR is fixing are added to this pull request
- [x] No unnecessary issues are included into this pull request.
cc @LucasLLC @MeetVadakkanchery @mhorowitz @pradeepfn @ekr0 @haochengsong @Saiteja64