tutorials icon indicating copy to clipboard operation
tutorials copied to clipboard

[DCP] Add DefaultStager example to distributed async checkpoint recipe

Open niyunsheng opened this issue 3 weeks ago • 7 comments

Fixes #3710

Description

This PR updates the distributed_async_checkpoint_recipe to include the DefaultStager functionality introduced in PyTorch 2.9.

Motivation: In large-scale training, even with standard async_save, the initial memory copy (Staging phase, GPU -> CPU) occurs on the main thread. This blocks the training loop. This PR introduces DefaultStager, which offloads this copy to a background thread, enabling full computation-communication overlap.

Key Changes:

  1. New Section: Added "Fully Asynchronous Staging with DefaultStager" to the recipe.
  2. Version Note: Added .. versionadded:: 2.9 to indicate version requirements.
  3. Advanced Example: Provided a code example demonstrating how to overlap the D2H copy with the entire Forward and Backward pass:
    • Check staging_completion after backward but before optimizer.step() to ensure data consistency while maximizing parallel execution.
    • Check upload_completion before the next save to manage memory backpressure.
  4. Authors: Added myself to the author list.

Checklist

  • [x] The issue that is being fixed is referred in the description (see above "Fixes #ISSUE_NUMBER")
  • [x] Only one issue is addressed in this pull request
  • [x] Labels from the issue that this PR is fixing are added to this pull request
  • [x] No unnecessary issues are included into this pull request.

cc @LucasLLC @MeetVadakkanchery @mhorowitz @pradeepfn @ekr0 @haochengsong @Saiteja64

niyunsheng avatar Dec 29 '25 13:12 niyunsheng