tutorials
tutorials copied to clipboard
Fix async checkpoint timing in DCP recipe
Move checkpoint_future.result() before optimizer.step() to ensure the previous checkpoint completes before weights are modified in-place. This allows better overlap of checkpointing with forward/backward passes.
Fixes #3584
Description
Checklist
- [ ] The issue that is being fixed is referred in the description (see above "Fixes #ISSUE_NUMBER")
- [ ] Only one issue is addressed in this pull request
- [ ] Labels from the issue that this PR is fixing are added to this pull request
- [ ] No unnecessary issues are included into this pull request.
cc @wconstab @osalpekar @H-Huang @kwen2501