ray
ray copied to clipboard
[release/air] `lightning_gpu_tune_3x16_3x1.aws` is flaky due to `LightningTrainer` not working with `PBT`
This is flaky because we only run 2 trials, so most of the time the network architectures of both trials may be the same, so checkpoints can be loaded without erroring.
Error logs:
ray.exceptions.RayTaskError(RuntimeError): ray::_RayTrainWorker__execute.get_next() (pid=43066, ip=10.0.58.87, actor_id=e04fe9b5203f0fbd2f74423f03000000, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7fd1f4679d90>)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/_internal/worker_group.py", line 32, in __execute
raise skipped from exception_cause(skipped)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/_internal/utils.py", line 129, in discard_return_wrapper
train_func(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/lightning/lightning_trainer.py", line 551, in _lightning_train_loop_per_worker
trainer.fit(lightning_module, **trainer_fit_params)
File "/home/ray/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 771, in fit
self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
File "/home/ray/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 723, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/home/ray/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1179, in _run
self._restore_modules_and_callbacks(ckpt_path)
File "/home/ray/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1141, in _restore_modules_and_callbacks
self._checkpoint_connector.restore_model()
File "/home/ray/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 179, in restore_model
self.trainer.strategy.load_model_state_dict(self._loaded_checkpoint)
File "/home/ray/anaconda3/lib/python3.7/site-packages/pytorch_lightning/strategies/strategy.py", line 319, in load_model_state_dict
self.lightning_module.load_state_dict(checkpoint["state_dict"])
File "/home/ray/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1668, in load_state_dict
self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for MNISTClassifier:
size mismatch for fc1.weight: copying a param with shape torch.Size([64, 784]) from checkpoint, the shape in current model is torch.Size([128, 784]).
size mismatch for fc1.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for fc2.weight: copying a param with shape torch.Size([10, 64]) from checkpoint, the shape in current model is torch.Size([10, 128]).
The trial LightningTrainer_d4100_00000 errored with parameters={'lightning_config': {'_module_class': <class 'lightning_test_utils.MNISTClassifier'>, '_module_init_config': {'feature_dim': 128, 'lr': 0.001}, '_trainer_init_config': {'max_epochs': 5, 'accelerator': 'gpu', 'logger': <pytorch_lightning.loggers.csv_logs.CSVLogger object at 0x7f1654294090>}, '_trainer_fit_params': {'datamodule': <lightning_test_utils.MNISTDataModule object at 0x7f167fd64b50>}, '_strategy_config': {}, '_model_checkpoint_config': {'monitor': 'val_accuracy', 'mode': 'max'}}}. Error file: /home/ray/ray_results/release-tuner-test/LightningTrainer_d4100_00000_0_feature_dim=64,lr=0.0100_2023-06-13_06-25-56/error.txt
Other relevant logs:
(RayTrainWorker pid=44474) Restoring states from the checkpoint path at /tmp/checkpoint_tmp_fce439b6245b44dd84005ffd26d07bc4/model
(RayTrainWorker pid=44474) /home/ray/anaconda3/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py:343: UserWarning: The dirpath has changed from 'logs/my_exp_name/version_3/checkpoints' to 'logs/my_exp_name/version_4/checkpoints', therefore `best_model_score`, `kth_best_model_path`, `kth_value`, `last_model_path` and `best_k_models` won't be reloaded. Only `best_model_path` will be reloaded.
(RayTrainWorker pid=44474) f"The dirpath has changed from {dirpath_from_ckpt!r} to {self.dirpath!r},"
Lightning loads back all training state, which includes the previous checkpoint -- so the AIR checkpoint may not always be respected, since we're trying to assign a checkpoint from a different trial. TODO: This needs more investigation of what actually happens.
The perturbed learning rate set by PBT is also not used, since Lightning's optimizer state will get loaded back fully. Generally, we should remove examples and tests of LightningTrainer with PBT, as it doesn't really do anything at the moment..