Remote training recovery from interruptions
Related:
- https://github.com/iterative/dvc/issues/9221
- #140
- #191
If you are training remotely and the machine shuts down, there's often no way to recover the last saved checkpoint on the new remote machine.
We have the tools to make it possible to recover in that scenario (without using DVC checkpoints) if we do something like:
- Each time the model is saved, DVCLive pushes the model to the remote and the metadata about it to Studio as part of live metrics updates. If the training is interrupted, all this info has been saved.
- When resuming training using
Live(resume=True), DVCLive can fetch the model using the info saved in step 1 if there is no model in the workspace.
We need some mechanism to tie the resumed experiment to the interrupted experiment. Is the experiment revision consistent between them? Should we require an experiment name be passed to tie them together?
See also https://docs.wandb.ai/guides/runs/resuming for ideas/comparison.
We have the tools to make it possible to recover in that scenario (without using DVC checkpoints) if we do something like:
To clarify, you mean that we have all the pieces to implement it, right?
Should we require an experiment name be passed to tie them together? See also https://docs.wandb.ai/guides/runs/resuming for ideas/comparison.
I think we could just:
-
resume=True== Try to resume from the workspace. -
resume="{exp_name}"== Try to resume from remote model / studio info.
Since this is open let me share how I handled that problem at the moment - as an idea/comparison.
Training is done on AWS EC2 instances with the use of Keras/TensorFlow.
The backup and restoration of the models is handled by the keras.callbacks.BackupAndRestore, the backups are not tracked by DVC, instead they are saved on the EFS that is attached to the instance, as EFS persists when instance gets terminated (BTW, other EFS is used as DVC cache). Now, one needs also a way to backup and restore the DVCLive progress. This requires some minor hacking as DVCLive does not communicate with BackupAndRestore callback. How this is done:
- assuming the DVCLive progress backup exists (in EFS!) copy that backup to the training repo DVCLive location and then (this order is important as of the current implementation of
DVCLiveCallback), declare an instance ofDVCLiveCallback(for keras) and append it to the list of callbacks (DVCLiveCallbacklooks for the existing progress upon declaration, so the backup needs to be restored before). - Write additional callback (say
DVCLiveCheckpoint) that willon_epoch_endcopy the DVCLive files to the EFS as a backup (by the order of the callbacks in the list I make sure that this is done after the model backup).
Now the final issue: assuming the training is triggered by GitHub action which uses CML to deploy EC2 instance, ... - what is the way to find the correct backup on EFS? Easy: use commit SHA, so for instance, store the backups on EFS inside the {COMMIT_SHA} dir.
And final polish: BackupAndRestore delates any backups after the successful training is done. So for the DVCLiveCheckpoint implement on_train_end method that will also delate the DVCLive backup.