[DCP] Add DefaultStager example to distributed async checkpoint recipe

Open niyunsheng opened this issue 3 weeks ago • 7 comments

Fixes #3710

Description

This PR updates the distributed_async_checkpoint_recipe to include the DefaultStager functionality introduced in PyTorch 2.9.

Motivation: In large-scale training, even with standard async_save, the initial memory copy (Staging phase, GPU -> CPU) occurs on the main thread. This blocks the training loop. This PR introduces DefaultStager, which offloads this copy to a background thread, enabling full computation-communication overlap.

Key Changes:

New Section: Added "Fully Asynchronous Staging with DefaultStager" to the recipe.
Version Note: Added .. versionadded:: 2.9 to indicate version requirements.
Advanced Example: Provided a code example demonstrating how to overlap the D2H copy with the entire Forward and Backward pass:
- Check staging_completion after backward but before optimizer.step() to ensure data consistency while maximizing parallel execution.
- Check upload_completion before the next save to manage memory backpressure.
Authors: Added myself to the author list.

Checklist

[x] The issue that is being fixed is referred in the description (see above "Fixes #ISSUE_NUMBER")
[x] Only one issue is addressed in this pull request
[x] Labels from the issue that this PR is fixing are added to this pull request
[x] No unnecessary issues are included into this pull request.

cc @LucasLLC @MeetVadakkanchery @mhorowitz @pradeepfn @ekr0 @haochengsong @Saiteja64

Dec 29 '25 13:12 niyunsheng