torchmd-net
torchmd-net copied to clipboard
Don't overwrite logs when resuming training
I often want to resume training from a checkpoint with --load-model. When I do that, I don't want to lose all the information in the log and metrics.csv files. The obvious way to do that is to create a new log directory for the continuation and use --log-dir and --redirect to tell it to put all new files in the new directory. But it doesn't work. Instead it ignores those options and uses the same log directory as the original training run, deleting and overwriting the existing logs in the process. To prevent that, you first need to copy your existing log directory to a new location. I've several times lost work by forgetting to do that.
How about making it so that --load-model does not override --log-dir and --redirect? That's just telling it what model to load. It wouldn't prevent you from saving logs to a different directory.
I'm not sure why overwriting log_dir doesn't work properly when load_model is set. The arguments are parsed in the order as they are defined in train.py. Since --load-model is the first argument there, the loaded model's hparams should always be overwritten by specifying further arguments via config file or CLI.
Is it because of this line in LoadFromCheckpoint?
https://github.com/torchmd/torchmd-net/blob/df7c906dd3c72dcc0c2fb9b6148fe624494b37b1/torchmdnet/utils.py#L181
namespace contains the options that were passed in. That line overwrites them with ones from the checkpoint, before they've yet had a chance to be processed.
I somehow thought that arguments are parsed in the order in which they are defined in the code but a quick test showed that this is clearly not true. So yes, that line is the problem. We should probably only update the namespace with arguments that were not specified by the user.