tutorials icon indicating copy to clipboard operation
tutorials copied to clipboard

Fix async checkpoint timing in DCP recipe

Open patrocinio opened this issue 1 month ago • 1 comments

Move checkpoint_future.result() before optimizer.step() to ensure the previous checkpoint completes before weights are modified in-place. This allows better overlap of checkpointing with forward/backward passes.

Fixes #3584

Description

Checklist

  • [ ] The issue that is being fixed is referred in the description (see above "Fixes #ISSUE_NUMBER")
  • [ ] Only one issue is addressed in this pull request
  • [ ] Labels from the issue that this PR is fixing are added to this pull request
  • [ ] No unnecessary issues are included into this pull request.

cc @wconstab @osalpekar @H-Huang @kwen2501

patrocinio avatar Dec 08 '25 23:12 patrocinio