ColossalAI 代码bugs太多了[BUG]:

🐛 Describe the bug

代码问题太多了，建议重新审核维护

Environment

No response

Jun 16 '23 02:06 wangmiaowei

Title: There are too many code bugs [BUG]:

Jun 16 '23 02:06 Issues-translate-bot

Could you please provide more details about the errors?

Jun 16 '23 02:06 flybird11111

Firstly, in examples/images/diffusion/configs/Teyvat/train_colossalai_teyvat.yaml. If I change use_ema: True. Then I get error like:

/opt/conda/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
/opt/conda/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
/opt/conda/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
Epoch 0:   1%| | 1/105 [00:12<21:38, 12.49s/it, loss=0.852, v_num=0, train/loss_simple_step=0.852, train/loss_v/opt/conda/lib/python3.7/site-packages/lightning/pytorch/strategies/ddp.py:437: UserWarning: Error handling mechanism for deadlock detection is uninitialized. Skipping check.
  rank_zero_warn("Error handling mechanism for deadlock detection is uninitialized. Skipping check.")
Traceback (most recent call last):
  File "/root/programs_wmw/sd_train/ColossalAI-main/examples/images/diffusion/main.py", line 847, in <module>
    trainer.fit(model, data)
  File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/trainer/trainer.py", line 609, in fit
    self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
  File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/trainer/call.py", line 38, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/trainer/trainer.py", line 650, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/trainer/trainer.py", line 1103, in _run
    results = self._run_stage()
  File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/trainer/trainer.py", line 1182, in _run_stage
Summoning checkpoint.
    self._run_train()
  File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/trainer/trainer.py", line 1205, in _run_train
    self.fit_loop.run()
  File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/loops/fit_loop.py", line 267, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/loops/epoch/training_epoch_loop.py", line 230, in advance
    self.trainer._call_lightning_module_hook("on_train_batch_end", batch_end_outputs, batch, batch_idx)
  File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/trainer/trainer.py", line 1347, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/root/programs_wmw/sd_train/ColossalAI-main/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 500, in on_train_batch_end
    self.model_ema(self.model)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/programs_wmw/sd_train/ColossalAI-main/examples/images/diffusion/ldm/modules/ema.py", line 46, in forward
    shadow_params[sname].sub_(one_minus_decay * (shadow_params[sname] - m_param[key]))
AttributeError: 'NoneType' object has no attribute 'sub_'
Traceback (most recent call last):
  File "/root/programs_wmw/sd_train/ColossalAI-main/examples/images/diffusion/main.py", line 847, in <module>
    trainer.fit(model, data)
  File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/trainer/trainer.py", line 609, in fit
    self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
  File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/trainer/call.py", line 38, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/trainer/trainer.py", line 650, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/trainer/trainer.py", line 1103, in _run
    results = self._run_stage()
  File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/trainer/trainer.py", line 1182, in _run_stage
    self._run_train()
  File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/trainer/trainer.py", line 1205, in _run_train
    self.fit_loop.run()
  File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/loops/fit_loop.py", line 267, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/loops/epoch/training_epoch_loop.py", line 230, in advance
    self.trainer._call_lightning_module_hook("on_train_batch_end", batch_end_outputs, batch, batch_idx)
  File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/trainer/trainer.py", line 1347, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/root/programs_wmw/sd_train/ColossalAI-main/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 500, in on_train_batch_end
    self.model_ema(self.model)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/programs_wmw/sd_train/ColossalAI-main/examples/images/diffusion/ldm/modules/ema.py", line 46, in forward
    shadow_params[sname].sub_(one_minus_decay * (shadow_params[sname] - m_param[key]))
AttributeError: 'NoneType' object has no attribute 'sub_'

Jun 16 '23 03:06 wangmiaowei

Besides, in your readme, you tell that I need to use xformers0.0.12. However, in the code https://github.com/hpcaitech/ColossalAI/blob/d4fb7bfda7a2da5480e1187e8d3e40884b42ba11/applications/Chat/coati/kernels/opt_attn.py#L72

You use attn_bias which is not found in the 0.0.12 xformers

        attn_output = xops.memory_efficient_attention(query_states,
                                                      key_states,
                                                      value_states,
                                                      attn_bias=xops.LowerTriangularMask(),
                                                      p=self.dropout if self.training else 0.0,
                                                      scale=self.scaling)

Jun 16 '23 03:06 wangmiaowei

In your reference readme file https://github.com/hpcaitech/ColossalAI/tree/d4fb7bfda7a2da5480e1187e8d3e40884b42ba11/examples/images/diffusion#:~:text=Teyvat/train_colossalai_teyvat.yaml-,Inference,logdir/checkpoints/last.ckpt%20%5C%0A%20%20%20%20%2D%2Dconfig%20/path/to/logdir/configs/project.yaml%20%20%5C,-usage%3A%20txt2img.py

The output project file does not have any "target" 2023-06-15T16-43-07-project.yaml which is requires by txt2img and error occures:

raise KeyError("Expected key `target` to instantiate.")
KeyError: 'Expected key `target` to instantiate.

Jun 16 '23 04:06 wangmiaowei

Another question, do you try to finetune stable diffusion V1.5? In this finetune code, I only see you use sd v2.0.ckpt. If I use v1.5-pruned.ckpt, then the mismatch error also occurs.

Jun 16 '23 04:06 wangmiaowei

In fact, this repo has lots of conflicts and errors. I hope your group carefully checks the whole part.

Jun 16 '23 04:06 wangmiaowei

Thank you. We will address these issues.

Jun 16 '23 05:06 flybird11111

If you want to run the chat, you can upgrade the version of xformer.

Jun 16 '23 05:06 flybird11111

除了xformers和1.5配置文件没给之外，用examples/images/diffusion代码去多机多卡训练，基本就没成功过。总会报各种奇怪的错，比如socket timeout，core dump，illegal instruction。甚至有时候单机多卡也会报socket timeout。看你们给的测试结果，你们应该也是成功在多机多卡上测试过的，为什么仓库的代码这么多bug呢？不过显存消耗和加速效果确实非常吸引人，我也确实很想尝试一下用colossalai多机多卡训sd，奈何你们这bug真的劝退。建议重新测试一下代码，再上传吧 @jiangmingyan

Jun 28 '23 06:06 zhangvia

Except that xformers and 1.5 configuration files are not given, using examples/images/diffusion codes to multi-machine multi-card training has basically failed. Various strange errors will always be reported, such as socket timeout, core dump, illegal instruction. Even sometimes a single machine with multiple cards will report socket timeout. Judging from the test results you gave, you should have successfully tested it on multiple machines and multiple cards. Why are there so many bugs in the warehouse code? However, the video memory consumption and acceleration effect are really attractive, and I really want to try to use colossalai multi-machine multi-card training SD, but your bug really discourages you. It is recommended to retest the code and upload it again @jiangmingyan

Jun 28 '23 06:06 Issues-translate-bot

@zhangvia 我现在在看mosaiai https://github.com/mosaicml/diffusion/blob/main/diffusion/train.py 至少能用！

Jun 28 '23 06:06 wangmiaowei

@zhangvia I'm looking at mosaiai now https://github.com/mosaicml/diffusion/blob/main/diffusion/train.py At least it works!

Jun 28 '23 06:06 Issues-translate-bot

diffusers的代码也可以用，但是显存消耗太大了，512的图bs只能设1...这个mosaiai显存消耗咋样，能多机多卡跑吗

Jun 28 '23 06:06 zhangvia

The code of diffusers can also be used, but the video memory consumption is too large, and the bs of the 512 picture can only be set to 1... How about the video memory consumption of this mosaiai, can it run on multiple machines and multiple cards

Jun 28 '23 06:06 Issues-translate-bot

@zhangvia 老哥，3090上可以跑的。。。单机多卡跑过

Jun 28 '23 07:06 wangmiaowei

@zhangvia Brother, it can run on 3090. . . Single-machine multi-card running

Jun 28 '23 07:06 Issues-translate-bot

@wangmiaowei 可以，我再调一调这个colossalai吧，实在不行就换mosaiai。colossalai不知道为啥确实显存降得特别多，速度也快，就是bug太多了。4090上 512，bs设16 都能跑确实强。可惜bug太多了

Jun 28 '23 08:06 zhangvia

@wangmiaowei Yes, let me adjust this colossalai again, if it doesn’t work, change to mosaiai. Colossalai doesn't know why the video memory has dropped so much and the speed is fast, but there are too many bugs. It is really strong to run 512 on 4090 and set 16 on bs. Too bad there are too many bugs

Jun 28 '23 08:06 Issues-translate-bot

@wangmiaowei we updated trining process using new Booster API: https://github.com/hpcaitech/ColossalAI/tree/feature/stable-diffusion/applications/stable-diffusion/text_img2img. you can check this new branch for stable-diffusion update. We re-trained stable-diffusion of V1.4. But I think it is quite similar to train a V1.5 version. You can only need to change model name and update fine-tuning dataset in bash script. Then, you can automatically train your model.

Jul 02 '23 06:07 tiandiao123

Would the current docker container support this new branch? @tiandiao123

Jul 02 '23 18:07 Thomas2419

not yet, we can make one！

Jul 03 '23 03:07 tiandiao123

where is the environment.yaml in the new branch? the same as the main repo examples/images/diffusion/environment.yaml? but i didn't see diffusers library in this environment.yaml. and you use diffusers in the new training script in new branch. what is the exact version of diffusers in your new training scripts? or could you please share the new environment.yaml file?

Jul 03 '23 07:07 zhangvia

@zhangvia 老哥试了吗？感觉如何？

Jul 04 '23 11:07 wangmiaowei

@zhangvia Did you try it? How does it feel?

Jul 04 '23 11:07 Issues-translate-bot

@tiandiao123 我试了这个新分支（I HAVE TRIED THIS NEW BRANCH），但是你们直接把ema功能给取消了。是遇到bug还没有解决吗？

Jul 05 '23 11:07 wangmiaowei

试了，可以跑起来，用那个分支的colossalai版本加上0.17.1diffusers可以跑，效果和之前的训练代码差不多。ema的话我看是没取消的，只不过from diffusers.training_utils import EMAModel这句被删掉了，加上应该就可以了吧，我没试。这个分支的代码也是有点问题的，看着像半成品

Jul 05 '23 11:07 zhangvia

Try it, you can run it, you can run it with the colossalai version of that branch plus 0.17.1diffusers, the effect is similar to the previous training code. I don’t think ema has been cancelled, but the sentence from diffusers.training_utils import EMAModel has been deleted, and it should be enough to add it. I haven’t tried it. The code of this branch is also a bit problematic, it looks like a semi-finished product

Jul 05 '23 11:07 Issues-translate-bot

@zhangvia 确实，就是个半成品，moving average功能直接腰斩了。

Jul 07 '23 03:07 wangmiaowei

@zhangvia Indeed, it is a semi-finished product, and the moving average function is directly cut in half.

Jul 07 '23 03:07 Issues-translate-bot