tnt icon indicating copy to clipboard operation
tnt copied to clipboard

Add torchsnapshot checkpoint saver integration

Open ananthsub opened this issue 2 years ago • 2 comments

Summary: Add a callback to torchTNT which saves checkpoints using torchsnapshot. this relies on the app state mixin defined here:

this means users can declare what module/optimizer/etc states they'd like to save: https://github.com/pytorch/tnt/blob/151b2b7bf025353919c1e7d6a93b79abd778ba2d/torchtnt/runner/unit.py#L54-L66

by default, we also include the training progress, and if applicable, the dataloader, and evaluation progress states based on if we're saving in the middle of the training epoch or not

Future TODOs

  • use async checkpointing
  • add a utility function for restoring states

Differential Revision: D39978072

ananthsub avatar Oct 19 '22 09:10 ananthsub

This pull request was exported from Phabricator. Differential Revision: D39978072

facebook-github-bot avatar Oct 19 '22 09:10 facebook-github-bot

Codecov Report

Merging #243 (f7c6bd6) into master (735067d) will increase coverage by 0.07%. The diff coverage is 91.91%.

@@            Coverage Diff             @@
##           master     #243      +/-   ##
==========================================
+ Coverage   88.93%   89.01%   +0.07%     
==========================================
  Files          83       85       +2     
  Lines        5126     5262     +136     
==========================================
+ Hits         4559     4684     +125     
- Misses        567      578      +11     
Impacted Files Coverage Δ
tests/runner/test_auto_unit.py 68.18% <ø> (ø)
torchtnt/runner/auto_unit.py 86.23% <ø> (ø)
torchtnt/runner/state.py 98.88% <ø> (ø)
torchtnt/runner/callbacks/torchsnapshot_saver.py 86.90% <86.90%> (ø)
tests/runner/callbacks/test_torchsnapshot_saver.py 100.00% <100.00%> (ø)
torchtnt/runner/callbacks/__init__.py 100.00% <100.00%> (ø)

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

codecov[bot] avatar Oct 19 '22 09:10 codecov[bot]

This pull request was exported from Phabricator. Differential Revision: D39978072

facebook-github-bot avatar Oct 21 '22 22:10 facebook-github-bot