pytorch-forecasting icon indicating copy to clipboard operation
pytorch-forecasting copied to clipboard

How to save a checkpoint on Google Drive at the end of every epoch by using pl.Trainer

Open LuigiSimeone opened this issue 2 years ago • 0 comments

PyTorch-Forecasting version: 0.10.2 PyTorch version: Python version: 3.8.5 Operating System: Windows

I am using Colab Pro + to train my model for about 100 epochs but everytime I get halted after about 80. I have no Cloud where to run on unfortunately.

Hidde_size and hidden_continuous_size affect obviously the training however I cannot change that. So my idea was to save a checkpoint on GDrive at the end of each epoch so that if I get interrupted I have the last checkpoint and I can simply upload it again and continue the training for the remaining epochs on another Colab VM.

Now this is quite easy in case the code is the original pytorch version (simply at the end of each train loop I save it), but with pytorch-forecasting the training is embedded into the pl.Trainer() so I do not get how to do that.

Otherwise, my only option is to modify directly the inner code of pl.Trainer to add this new "saving" line of code, but I normally do not like to do this.

Can anybody help please?

LuigiSimeone avatar Aug 04 '22 13:08 LuigiSimeone