pytorch-lightning
pytorch-lightning copied to clipboard
checkpoint migration
🚀 Feature
Add upgrade functions to utilities for internal use. Checkpoints get upgraded automatically (when possible) when a user loads the checkpoint using Trainer(resume_from_checkpoint)
or Model.load_from_checkpoint
.
Motivation
Lightning changes over time with removals and additions, this includes checkpoint contents and structure. When changes happen, we bake the upgrade logic into the code base at the appropriate place, but the danger is that the information why and when these changes were made gets lost over time.
Pitch
For each BC change we create an upgrade function that gets applied at the appropriate place.
def upgrade_xyz_v1_2_0(checkpoint):
# upgrades the checkpoint from a previous version to 1.2.0
return checkpoint
def upgrade_abc_v1_3_8(checkpoint):
# upgrades the checkpoint from previous version to 1.3.8
return checkpoint
def upgrade(checkpoint)
checkpoint = upgrade_xyz_v1_2_0(checkpoint)
checkpoint = upgrade_abc_v1_3_8(checkpoint)
return
# in Lightning:
ckpt = upgrade(pl_load(path))
Benefits with this approach are:
- each upgrade is documented individually
- central location for all upgrades, the order in which they are applied is fully transparent
- can unit test each upgrade individually!
Alternatives
keep as is
Additional context
PRs that started this work:
- #9166: legacy load context manager to patch Lightning for unpickling
- #8558: upgrade functions
PRs that added checkpoint back-compatibility logic that can be avoided by this proposal:
- #11638
If you enjoy Lightning, check out our other projects! ⚡
-
Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
-
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning
-
Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch
-
Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.
cc @borda @awaelchli @ananthsub @ninginthecloud @rohitgr7 @otaj
cc @kandluis @aazolini @yifuwang who were also curious if there's a serialization format we stick to for the state dict, or if the contents of the state dict are considered "internal state"
@akihironitta, we shall address it in upcoming weeks
- add testing for loading legacy checkpoints (we have been missing ckpts from 1.4)
- if possible add a script for automatic conversion
in addition, we can add docs page for updating API, as we have in code hint how to update from one version to next one (e.g. from 1.4 for 1.5 or from 1.5 to 1.6) but missing nay larger steps as for example 1.2 to 1.7
@Borda This issue/proposal is less about upgrading code or legacy testing, but more about a mechanism to load old checkpoints into a new updated Lightning code base. Of course, retrospectively testing for old checkpoints can be done, but ideally we would like to have a migration mechanism built-in.
Can we close this? Anything left?
This is done 🎉