ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: stable diffusion add --resume run error

Open yufengyao-lingoace opened this issue 2 years ago • 4 comments

🐛 Describe the bug

When I add --resume parameter to load a last.ckpt file(stable diffusion), the train runs error as follows:

/home/ubuntu/anaconda3/envs/ldmco/lib/python3.9/site-packages/pytorch_lightning/loggers/tensorboard.py:277: UserWarning: Could not log computational graph to TensorBoard: The model.example_input_array attribute is not set or input_array was not given. rank_zero_warn( 'ZeroOptimizer' object has no attribute 'state'

Traceback (most recent call last): File "/data/stablediffusionv2_finetune/main.py", line 920, in trainer.fit(model, data) File "/home/ubuntu/anaconda3/envs/ldmco/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 602, in fit call._call_and_handle_interrupt( File "/home/ubuntu/anaconda3/envs/ldmco/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs) File "/home/ubuntu/anaconda3/envs/ldmco/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 88, in launch return function(*args, **kwargs) File "/home/ubuntu/anaconda3/envs/ldmco/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 644, in _fit_impl self._run(model, ckpt_path=self.ckpt_path) File "/home/ubuntu/anaconda3/envs/ldmco/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1093, in _run self._checkpoint_connector.restore_training_state() File "/home/ubuntu/anaconda3/envs/ldmco/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 286, in restore_training_state self.restore_optimizers_and_schedulers() File "/home/ubuntu/anaconda3/envs/ldmco/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 382, in restore_optimizers_and_schedulers self.restore_optimizers() File "/home/ubuntu/anaconda3/envs/ldmco/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 397, in restore_optimizers self.trainer.strategy.load_optimizer_state_dict(self._loaded_checkpoint) File "/home/ubuntu/anaconda3/envs/ldmco/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 369, in load_optimizer_state_dict _optimizer_to_device(optimizer, self.root_device) File "/home/ubuntu/anaconda3/envs/ldmco/lib/python3.9/site-packages/lightning_lite/utilities/optimizer.py", line 33, in _optimizer_to_device for p, v in optimizer.state.items(): AttributeError: 'ZeroOptimizer' object has no attribute 'state'

Environment

No response

yufengyao-lingoace avatar Jan 12 '23 04:01 yufengyao-lingoace

Were you using zero optimizer at your last run? ZeroOptimizer class does not have state variable (it has optim_state), so there is a mismatch with the checkpoint.

JThh avatar Jan 26 '23 12:01 JThh

Hi @JThh , I meet the same issue. WOuld you give me an example of saving ZeroOptimizer and loading ZeroOptimizer? Thanks

shileims avatar Apr 17 '23 03:04 shileims

Hi @JThh , Quick question: In the code, the optimizer and model is loaded from saved checkpoint when the local_rank == 0. If distributed training is used to train the model, I think all the processes should load the saved checkpoint, is it correct? Thanks

shileims avatar Apr 17 '23 22:04 shileims

Based on your second question, you should've found our saving and loading ckpt utilities.

This line has already gathered the tensors and wiped out inter-device differences before we save at rank 0.

And this line broadcasts model weights across devices after loading from rank 0. The same goes for optimizers.

Hope this helps!

JThh avatar Apr 18 '23 08:04 JThh