mlr3torch icon indicating copy to clipboard operation
mlr3torch copied to clipboard

restart learning from most recent checkpoint?

Open tdhock opened this issue 4 months ago • 4 comments

Can you please add some mention of how to use the pt files shown in the example on https://mlr3torch.mlr-org.com/reference/mlr_callback_set.checkpoint.html ? @sebffischer is it possible to start learning in a new R session from one of these files? (I believe it should be possible, because that is the goal of checkpointing, so I expected that should be documented)

Thanks!!

tdhock avatar Aug 01 '25 01:08 tdhock

related to https://github.com/mlr-org/mlr3torch/issues/142#issuecomment-2609319544

tdhock avatar Aug 01 '25 01:08 tdhock

Yeah, this should definitely be possible. In principle we have a mechanism for this in mlr3, which is called "hotstarting" so we were also discussing implementing this in a "nice" way.

The idea would be to make something like learner$train(task); learner$configure(epochs = 5); learner$continue(tasks) also work. But at least for now we can solve it via the callback, which is easier.

sebffischer avatar Aug 01 '25 06:08 sebffischer

I am a bit busy until next friday, so if you need it before that you would have to go at it yourself. It should not be too difficult though. https://mlr3torch.mlr-org.com/articles/callbacks.html contains a short tutorial on adding a custom callback.

In principle, you need to:

  1. self$ctx$network$load_state_dict(network_params) where network_params is the $state_dict() of the network
  2. self$ctx$optimizer$load_state_dict(optim_state), where optim_state is the $state_dict() of the optimizer.

You might also want to handle loading the state dicts of the callbacks, but I guess this is not absolutely necessary (e.g. if you don't handle this, he history from the previous training run would be "forgotten". There is already some infrastructure in place also for the callbacks ($state_dict() and $load_state_dict()) so it should also be possible.

sebffischer avatar Aug 01 '25 06:08 sebffischer

thanks for the info. I probably don't have time to work on this for a few weeks. for context, I run some experiments on the cluster with a lot of epochs (10000) and some jobs are interrupted before they finish due to time limits. it would be nice to be able to restart these jobs at say 5000 epochs instead of 0.

tdhock avatar Aug 04 '25 11:08 tdhock