ColossalAI
ColossalAI copied to clipboard
[BUG]: Stable diffusion v1 killed
🐛 Describe the bug
finetune stable diffusion v1时,出现killed问题。 现象类似这个issues: https://github.com/hpcaitech/ColossalAI/issues/2398
后面有时候也会也出现以下错误提示: …… File "/workspace/ColossalAI-0.1.10/ColossalAI/examples/images/diffusion/lightning-1.8.1/lightning/src/lightning/pytorch/strategies/strategy.py", line 183, in optimizer_state return optimizer.state_dict() File "/workspace/ColossalAI-0.1.10/ColossalAI/colossalai/zero/zero_optimizer.py", line 251, in state_dict dist.gather_object(local_state, File "/opt/conda/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1758, in gather_object all_gather(object_size_list, local_size, group=group) File "/opt/conda/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2075, in all_gather work.wait() RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:598] Connection closed by peer [127.0.0.1]:33451
Environment
我是在docker 环境中部署的stable diffusion。 cuda 12.0 nvcc 11.3 pytorch 1.12 colossalai 0.1.10cu113 lightning:1.8.1 build from source