pytorch-lightning checkpoint migration

🚀 Feature

Add upgrade functions to utilities for internal use. Checkpoints get upgraded automatically (when possible) when a user loads the checkpoint using Trainer(resume_from_checkpoint) or Model.load_from_checkpoint.

Motivation

Lightning changes over time with removals and additions, this includes checkpoint contents and structure. When changes happen, we bake the upgrade logic into the code base at the appropriate place, but the danger is that the information why and when these changes were made gets lost over time.

Pitch

For each BC change we create an upgrade function that gets applied at the appropriate place.

def upgrade_xyz_v1_2_0(checkpoint):
    # upgrades the checkpoint from a previous version to 1.2.0
    return checkpoint

def upgrade_abc_v1_3_8(checkpoint):
    # upgrades the checkpoint from previous version to 1.3.8
    return checkpoint


def upgrade(checkpoint)
    checkpoint = upgrade_xyz_v1_2_0(checkpoint)
    checkpoint = upgrade_abc_v1_3_8(checkpoint)
    return

# in Lightning:
ckpt = upgrade(pl_load(path))

Benefits with this approach are:

each upgrade is documented individually
central location for all upgrades, the order in which they are applied is fully transparent
can unit test each upgrade individually!

Alternatives

keep as is

Additional context

PRs that started this work:

#9166: legacy load context manager to patch Lightning for unpickling
#8558: upgrade functions

PRs that added checkpoint back-compatibility logic that can be avoided by this proposal:

#11638

If you enjoy Lightning, check out our other projects! ⚡

_{Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning

Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch

Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.}

cc @borda @awaelchli @ananthsub @ninginthecloud @rohitgr7 @otaj

Sep 09 '21 10:09 awaelchli

cc @kandluis @aazolini @yifuwang who were also curious if there's a serialization format we stick to for the state dict, or if the contents of the state dict are considered "internal state"

Sep 09 '21 18:09 ananthsub

@akihironitta, we shall address it in upcoming weeks

add testing for loading legacy checkpoints (we have been missing ckpts from 1.4)
if possible add a script for automatic conversion

in addition, we can add docs page for updating API, as we have in code hint how to update from one version to next one (e.g. from 1.4 for 1.5 or from 1.5 to 1.6) but missing nay larger steps as for example 1.2 to 1.7

Sep 20 '22 17:09 Borda

@Borda This issue/proposal is less about upgrading code or legacy testing, but more about a mechanism to load old checkpoints into a new updated Lightning code base. Of course, retrospectively testing for old checkpoints can be done, but ideally we would like to have a migration mechanism built-in.

Sep 20 '22 20:09 awaelchli

Can we close this? Anything left?

Nov 12 '22 18:11 carmocca

This is done 🎉

Nov 12 '22 23:11 awaelchli

pytorch-lightning pytorch-lightning copied to clipboard

checkpoint migration

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

If you enjoy Lightning, check out our other projects! ⚡

pytorch-lightning
pytorch-lightning copied to clipboard