[release/air] `lightning_gpu_tune_3x16_3x1.aws` is flaky due to `LightningTrainer` not working with `PBT`

Open justinvyu opened this issue 2 years ago • 0 comments

This is flaky because we only run 2 trials, so most of the time the network architectures of both trials may be the same, so checkpoints can be loaded without erroring.

Error logs:

ray.exceptions.RayTaskError(RuntimeError): ray::_RayTrainWorker__execute.get_next() (pid=43066, ip=10.0.58.87, actor_id=e04fe9b5203f0fbd2f74423f03000000, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7fd1f4679d90>)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/_internal/worker_group.py", line 32, in __execute
    raise skipped from exception_cause(skipped)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/_internal/utils.py", line 129, in discard_return_wrapper
    train_func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/lightning/lightning_trainer.py", line 551, in _lightning_train_loop_per_worker
    trainer.fit(lightning_module, **trainer_fit_params)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 771, in fit
    self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
  File "/home/ray/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 723, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1179, in _run
    self._restore_modules_and_callbacks(ckpt_path)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1141, in _restore_modules_and_callbacks
    self._checkpoint_connector.restore_model()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 179, in restore_model
    self.trainer.strategy.load_model_state_dict(self._loaded_checkpoint)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/pytorch_lightning/strategies/strategy.py", line 319, in load_model_state_dict
    self.lightning_module.load_state_dict(checkpoint["state_dict"])
  File "/home/ray/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1668, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for MNISTClassifier:
        size mismatch for fc1.weight: copying a param with shape torch.Size([64, 784]) from checkpoint, the shape in current model is torch.Size([128, 784]).
        size mismatch for fc1.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([128]).
        size mismatch for fc2.weight: copying a param with shape torch.Size([10, 64]) from checkpoint, the shape in current model is torch.Size([10, 128]).
The trial LightningTrainer_d4100_00000 errored with parameters={'lightning_config': {'_module_class': <class 'lightning_test_utils.MNISTClassifier'>, '_module_init_config': {'feature_dim': 128, 'lr': 0.001}, '_trainer_init_config': {'max_epochs': 5, 'accelerator': 'gpu', 'logger': <pytorch_lightning.loggers.csv_logs.CSVLogger object at 0x7f1654294090>}, '_trainer_fit_params': {'datamodule': <lightning_test_utils.MNISTDataModule object at 0x7f167fd64b50>}, '_strategy_config': {}, '_model_checkpoint_config': {'monitor': 'val_accuracy', 'mode': 'max'}}}. Error file: /home/ray/ray_results/release-tuner-test/LightningTrainer_d4100_00000_0_feature_dim=64,lr=0.0100_2023-06-13_06-25-56/error.txt

Other relevant logs:

(RayTrainWorker pid=44474) Restoring states from the checkpoint path at /tmp/checkpoint_tmp_fce439b6245b44dd84005ffd26d07bc4/model
(RayTrainWorker pid=44474) /home/ray/anaconda3/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py:343: UserWarning: The dirpath has changed from 'logs/my_exp_name/version_3/checkpoints' to 'logs/my_exp_name/version_4/checkpoints', therefore `best_model_score`, `kth_best_model_path`, `kth_value`, `last_model_path` and `best_k_models` won't be reloaded. Only `best_model_path` will be reloaded.
(RayTrainWorker pid=44474)   f"The dirpath has changed from {dirpath_from_ckpt!r} to {self.dirpath!r},"

Lightning loads back all training state, which includes the previous checkpoint -- so the AIR checkpoint may not always be respected, since we're trying to assign a checkpoint from a different trial. TODO: This needs more investigation of what actually happens.

The perturbed learning rate set by PBT is also not used, since Lightning's optimizer state will get loaded back fully. Generally, we should remove examples and tests of LightningTrainer with PBT, as it doesn't really do anything at the moment..

Jun 14 '23 01:06 justinvyu