tnt
tnt copied to clipboard
Add torchsnapshot checkpoint saver integration
Summary: Add a callback to torchTNT which saves checkpoints using torchsnapshot. this relies on the app state mixin defined here:
this means users can declare what module/optimizer/etc states they'd like to save: https://github.com/pytorch/tnt/blob/151b2b7bf025353919c1e7d6a93b79abd778ba2d/torchtnt/runner/unit.py#L54-L66
by default, we also include the training progress, and if applicable, the dataloader, and evaluation progress states based on if we're saving in the middle of the training epoch or not
Future TODOs
- use async checkpointing
- add a utility function for restoring states
Differential Revision: D39978072
This pull request was exported from Phabricator. Differential Revision: D39978072
Codecov Report
Merging #243 (f7c6bd6) into master (735067d) will increase coverage by
0.07%
. The diff coverage is91.91%
.
@@ Coverage Diff @@
## master #243 +/- ##
==========================================
+ Coverage 88.93% 89.01% +0.07%
==========================================
Files 83 85 +2
Lines 5126 5262 +136
==========================================
+ Hits 4559 4684 +125
- Misses 567 578 +11
Impacted Files | Coverage Δ | |
---|---|---|
tests/runner/test_auto_unit.py | 68.18% <ø> (ø) |
|
torchtnt/runner/auto_unit.py | 86.23% <ø> (ø) |
|
torchtnt/runner/state.py | 98.88% <ø> (ø) |
|
torchtnt/runner/callbacks/torchsnapshot_saver.py | 86.90% <86.90%> (ø) |
|
tests/runner/callbacks/test_torchsnapshot_saver.py | 100.00% <100.00%> (ø) |
|
torchtnt/runner/callbacks/__init__.py | 100.00% <100.00%> (ø) |
:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more
This pull request was exported from Phabricator. Differential Revision: D39978072