torchmd-net icon indicating copy to clipboard operation
torchmd-net copied to clipboard

Error resuming training

Open peastman opened this issue 2 years ago • 1 comments

I just encountered an error I've never seen before. I used the --load-model command line argument to resume training from a checkpoint. At first everything seemed to be working correctly, but after completing four epochs it exited with this error.

Traceback (most recent call last):
  File "/global/homes/p/peastman/torchmd-net/scripts/train.py", line 164, in <module>
    main()
  File "/global/homes/p/peastman/torchmd-net/scripts/train.py", line 160, in main
    trainer.test(model, data)
  File "/global/homes/p/peastman/miniconda3/envs/torchmd/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 936, in test
    return self._call_and_handle_interrupt(self._test_impl, model, dataloaders, ckpt_path, verbose, datamodule)
  File "/global/homes/p/peastman/miniconda3/envs/torchmd/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/global/homes/p/peastman/miniconda3/envs/torchmd/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 983, in _test_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/global/homes/p/peastman/miniconda3/envs/torchmd/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1222, in _run
    self._log_hyperparams()
  File "/global/homes/p/peastman/miniconda3/envs/torchmd/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1277, in _log_hyperparams
    raise MisconfigurationException(
pytorch_lightning.utilities.exceptions.MisconfigurationException: Error while merging hparams: the keys ['load_model'] are present in both the LightningModule's and LightningDataModule's hparams but have different values.

peastman avatar Nov 18 '22 18:11 peastman

It looks like training stopped after 4 epochs since the error occurred while calling trainer.test in the training script. After fit we load the best model checkpoint and then evaluate it on the test set. The best checkpoint that was loaded probably was saved in the previous training run? So the value of load_model is probably None, while the current DataModule contains a different value for load_value.

The problem in this specific case is probably something else since stopping training after 4 epochs was probably not intended? This error is definitely not very intuitive though.

A potential fix would be to just pass the test dataloader instead of the full DataModule.

dav0dea avatar Nov 18 '22 20:11 dav0dea