ColossalAI
ColossalAI copied to clipboard
[BUG]: stable diffusion add --resume run error
🐛 Describe the bug
When I add --resume parameter to load a last.ckpt file(stable diffusion), the train runs error as follows:
/home/ubuntu/anaconda3/envs/ldmco/lib/python3.9/site-packages/pytorch_lightning/loggers/tensorboard.py:277: UserWarning: Could not log computational graph to TensorBoard: The model.example_input_array attribute is not set or input_array was not given.
rank_zero_warn(
'ZeroOptimizer' object has no attribute 'state'
Traceback (most recent call last):
File "/data/stablediffusionv2_finetune/main.py", line 920, in
Environment
No response
Were you using zero optimizer at your last run? ZeroOptimizer class does not have state variable (it has optim_state), so there is a mismatch with the checkpoint.
Hi @JThh , I meet the same issue. WOuld you give me an example of saving ZeroOptimizer and loading ZeroOptimizer? Thanks
Hi @JThh , Quick question: In the code, the optimizer and model is loaded from saved checkpoint when the local_rank == 0. If distributed training is used to train the model, I think all the processes should load the saved checkpoint, is it correct? Thanks
Based on your second question, you should've found our saving and loading ckpt utilities.
This line has already gathered the tensors and wiped out inter-device differences before we save at rank 0.
And this line broadcasts model weights across devices after loading from rank 0. The same goes for optimizers.
Hope this helps!