ColossalAI
ColossalAI copied to clipboard
代码bugs太多了[BUG]:
🐛 Describe the bug
代码问题太多了,建议重新审核维护
Environment
No response
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
Title: There are too many code bugs [BUG]:
Could you please provide more details about the errors?
Firstly, in examples/images/diffusion/configs/Teyvat/train_colossalai_teyvat.yaml. If I change use_ema: True. Then I get error like:
/opt/conda/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
/opt/conda/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
/opt/conda/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
Epoch 0: 1%| | 1/105 [00:12<21:38, 12.49s/it, loss=0.852, v_num=0, train/loss_simple_step=0.852, train/loss_v/opt/conda/lib/python3.7/site-packages/lightning/pytorch/strategies/ddp.py:437: UserWarning: Error handling mechanism for deadlock detection is uninitialized. Skipping check.
rank_zero_warn("Error handling mechanism for deadlock detection is uninitialized. Skipping check.")
Traceback (most recent call last):
File "/root/programs_wmw/sd_train/ColossalAI-main/examples/images/diffusion/main.py", line 847, in <module>
trainer.fit(model, data)
File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/trainer/trainer.py", line 609, in fit
self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/trainer/call.py", line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/trainer/trainer.py", line 650, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/trainer/trainer.py", line 1103, in _run
results = self._run_stage()
File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/trainer/trainer.py", line 1182, in _run_stage
Summoning checkpoint.
self._run_train()
File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/trainer/trainer.py", line 1205, in _run_train
self.fit_loop.run()
File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/loops/fit_loop.py", line 267, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/loops/epoch/training_epoch_loop.py", line 230, in advance
self.trainer._call_lightning_module_hook("on_train_batch_end", batch_end_outputs, batch, batch_idx)
File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/trainer/trainer.py", line 1347, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/root/programs_wmw/sd_train/ColossalAI-main/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 500, in on_train_batch_end
self.model_ema(self.model)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/root/programs_wmw/sd_train/ColossalAI-main/examples/images/diffusion/ldm/modules/ema.py", line 46, in forward
shadow_params[sname].sub_(one_minus_decay * (shadow_params[sname] - m_param[key]))
AttributeError: 'NoneType' object has no attribute 'sub_'
Traceback (most recent call last):
File "/root/programs_wmw/sd_train/ColossalAI-main/examples/images/diffusion/main.py", line 847, in <module>
trainer.fit(model, data)
File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/trainer/trainer.py", line 609, in fit
self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/trainer/call.py", line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/trainer/trainer.py", line 650, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/trainer/trainer.py", line 1103, in _run
results = self._run_stage()
File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/trainer/trainer.py", line 1182, in _run_stage
self._run_train()
File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/trainer/trainer.py", line 1205, in _run_train
self.fit_loop.run()
File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/loops/fit_loop.py", line 267, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/loops/epoch/training_epoch_loop.py", line 230, in advance
self.trainer._call_lightning_module_hook("on_train_batch_end", batch_end_outputs, batch, batch_idx)
File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/trainer/trainer.py", line 1347, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/root/programs_wmw/sd_train/ColossalAI-main/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 500, in on_train_batch_end
self.model_ema(self.model)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/root/programs_wmw/sd_train/ColossalAI-main/examples/images/diffusion/ldm/modules/ema.py", line 46, in forward
shadow_params[sname].sub_(one_minus_decay * (shadow_params[sname] - m_param[key]))
AttributeError: 'NoneType' object has no attribute 'sub_'
Besides, in your readme, you tell that I need to use xformers0.0.12. However, in the code https://github.com/hpcaitech/ColossalAI/blob/d4fb7bfda7a2da5480e1187e8d3e40884b42ba11/applications/Chat/coati/kernels/opt_attn.py#L72
You use attn_bias which is not found in the 0.0.12 xformers
attn_output = xops.memory_efficient_attention(query_states,
key_states,
value_states,
attn_bias=xops.LowerTriangularMask(),
p=self.dropout if self.training else 0.0,
scale=self.scaling)
In your reference readme file https://github.com/hpcaitech/ColossalAI/tree/d4fb7bfda7a2da5480e1187e8d3e40884b42ba11/examples/images/diffusion#:~:text=Teyvat/train_colossalai_teyvat.yaml-,Inference,logdir/checkpoints/last.ckpt%20%5C%0A%20%20%20%20%2D%2Dconfig%20/path/to/logdir/configs/project.yaml%20%20%5C,-usage%3A%20txt2img.py
The output project file does not have any "target" 2023-06-15T16-43-07-project.yaml which is requires by txt2img and error occures:
raise KeyError("Expected key `target` to instantiate.")
KeyError: 'Expected key `target` to instantiate.
Another question, do you try to finetune stable diffusion V1.5? In this finetune code, I only see you use sd v2.0.ckpt. If I use v1.5-pruned.ckpt, then the mismatch error also occurs.
In fact, this repo has lots of conflicts and errors. I hope your group carefully checks the whole part.
Thank you. We will address these issues.
Besides, in your readme, you tell that I need to use xformers0.0.12. However, in the code
https://github.com/hpcaitech/ColossalAI/blob/d4fb7bfda7a2da5480e1187e8d3e40884b42ba11/applications/Chat/coati/kernels/opt_attn.py#L72
You use attn_bias which is not found in the 0.0.12 xformers
attn_output = xops.memory_efficient_attention(query_states, key_states, value_states, attn_bias=xops.LowerTriangularMask(), p=self.dropout if self.training else 0.0, scale=self.scaling)
If you want to run the chat, you can upgrade the version of xformer.
除了xformers和1.5配置文件没给之外,用examples/images/diffusion代码去多机多卡训练,基本就没成功过。总会报各种奇怪的错,比如socket timeout,core dump,illegal instruction。甚至有时候单机多卡也会报socket timeout。看你们给的测试结果,你们应该也是成功在多机多卡上测试过的,为什么仓库的代码这么多bug呢?不过显存消耗和加速效果确实非常吸引人,我也确实很想尝试一下用colossalai多机多卡训sd,奈何你们这bug真的劝退。建议重新测试一下代码,再上传吧 @jiangmingyan
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
Except that xformers and 1.5 configuration files are not given, using examples/images/diffusion codes to multi-machine multi-card training has basically failed. Various strange errors will always be reported, such as socket timeout, core dump, illegal instruction. Even sometimes a single machine with multiple cards will report socket timeout. Judging from the test results you gave, you should have successfully tested it on multiple machines and multiple cards. Why are there so many bugs in the warehouse code? However, the video memory consumption and acceleration effect are really attractive, and I really want to try to use colossalai multi-machine multi-card training SD, but your bug really discourages you. It is recommended to retest the code and upload it again @jiangmingyan
@zhangvia 我现在在看mosaiai https://github.com/mosaicml/diffusion/blob/main/diffusion/train.py 至少能用!
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
@zhangvia I'm looking at mosaiai now https://github.com/mosaicml/diffusion/blob/main/diffusion/train.py At least it works!
@zhangvia 我现在在看mosaiai https://github.com/mosaicml/diffusion/blob/main/diffusion/train.py 至少能用!
diffusers的代码也可以用,但是显存消耗太大了,512的图bs只能设1...这个mosaiai显存消耗咋样,能多机多卡跑吗
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
@zhangvia I am looking at mosaiai https://github.com/mosaicml/diffusion/blob/main/diffusion/train.py at least it works!
The code of diffusers can also be used, but the video memory consumption is too large, and the bs of the 512 picture can only be set to 1... How about the video memory consumption of this mosaiai, can it run on multiple machines and multiple cards
@zhangvia 老哥,3090上可以跑的。。。单机多卡跑过
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
@zhangvia Brother, it can run on 3090. . . Single-machine multi-card running
@wangmiaowei 可以,我再调一调这个colossalai吧,实在不行就换mosaiai。colossalai不知道为啥确实显存降得特别多,速度也快,就是bug太多了。4090上 512,bs设16 都能跑确实强。可惜bug太多了
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
@wangmiaowei Yes, let me adjust this colossalai again, if it doesn’t work, change to mosaiai. Colossalai doesn't know why the video memory has dropped so much and the speed is fast, but there are too many bugs. It is really strong to run 512 on 4090 and set 16 on bs. Too bad there are too many bugs
Another question, do you try to finetune stable diffusion V1.5? In this finetune code, I only see you use sd v2.0.ckpt. If I use v1.5-pruned.ckpt, then the mismatch error also occurs.
@wangmiaowei we updated trining process using new Booster API: https://github.com/hpcaitech/ColossalAI/tree/feature/stable-diffusion/applications/stable-diffusion/text_img2img. you can check this new branch for stable-diffusion update. We re-trained stable-diffusion of V1.4. But I think it is quite similar to train a V1.5 version. You can only need to change model name and update fine-tuning dataset in bash script. Then, you can automatically train your model.
Would the current docker container support this new branch? @tiandiao123
Would the current docker container support this new branch? @tiandiao123
not yet, we can make one!
Another question, do you try to finetune stable diffusion V1.5? In this finetune code, I only see you use sd v2.0.ckpt. If I use v1.5-pruned.ckpt, then the mismatch error also occurs.
@wangmiaowei we updated trining process using new Booster API: https://github.com/hpcaitech/ColossalAI/tree/feature/stable-diffusion/applications/stable-diffusion/text_img2img. you can check this new branch for stable-diffusion update. We re-trained stable-diffusion of V1.4. But I think it is quite similar to train a V1.5 version. You can only need to change model name and update fine-tuning dataset in bash script. Then, you can automatically train your model.
where is the environment.yaml in the new branch? the same as the main repo examples/images/diffusion/environment.yaml? but i didn't see diffusers library in this environment.yaml. and you use diffusers in the new training script in new branch. what is the exact version of diffusers in your new training scripts? or could you please share the new environment.yaml file?
@zhangvia 老哥试了吗?感觉如何?
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
@zhangvia Did you try it? How does it feel?
@tiandiao123 我试了这个新分支(I HAVE TRIED THIS NEW BRANCH),但是你们直接把ema功能给取消了。是遇到bug还没有解决吗?
@zhangvia 老哥试了吗?感觉如何?
试了,可以跑起来,用那个分支的colossalai版本加上0.17.1diffusers可以跑,效果和之前的训练代码差不多。ema的话我看是没取消的,只不过from diffusers.training_utils import EMAModel这句被删掉了,加上应该就可以了吧,我没试。这个分支的代码也是有点问题的,看着像半成品
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
Did @zhangvia brother try it? How does it feel?
Try it, you can run it, you can run it with the colossalai version of that branch plus 0.17.1diffusers, the effect is similar to the previous training code. I don’t think ema has been cancelled, but the sentence from diffusers.training_utils import EMAModel has been deleted, and it should be enough to add it. I haven’t tried it. The code of this branch is also a bit problematic, it looks like a semi-finished product
@zhangvia 确实,就是个半成品,moving average功能直接腰斩了。
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
@zhangvia Indeed, it is a semi-finished product, and the moving average function is directly cut in half.