🐛 Describe the bug
Hello, Thanks for the remarkable improvement of diffusion. I have some problems when running diffusion code and want your help.
I run
sudo pip3 install torch==1.12.0+cu102 torchvision==0.13.0+cu102 torchaudio==0.12.0 --extra-index-url https://download.pytorch.org/whl/cu102
sudo pip3 install transformers==4.19.2 diffusers invisible-watermark
sudo pip3 install -e .
and install lightning from https://github.com/Lightning-AI/lightning.git
and install Install Colossal-AI by sudo pip3 install colossalai==0.1.12+torch1.12cu11.3 -f https://release.colossalai.org
Then I run train_colossalai.sh but meet a error about the optimizer:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/colossalai/nn/optimizer/hybrid_adam.py", line 81, in init
import colossalai._C.fused_optim
ImportError: /usr/local/lib/python3.8/dist-packages/colossalai/_C/fused_optim.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops19empty_memory_format4callEN3c108ArrayRefIlEENS2_8optionalINS2_10ScalarTypeEEENS5_INS2_6LayoutEEENS5_INS2_6DeviceEEENS5_IbEENS5_INS2_12MemoryFormatEEE
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "main.py", line 805, in
trainer.fit(model, data)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit
call._call_and_handle_interrupt(
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 90, in launch
return function(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1079, in _run
self.strategy.setup(self)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/strategies/colossalai.py", line 332, in setup
self.setup_optimizers(trainer)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/strategies/strategy.py", line 142, in setup_optimizers
self.optimizers, self.lr_scheduler_configs, self.optimizer_frequencies = _init_optimizers_and_lr_schedulers(
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/core/optimizer.py", line 180, in _init_optimizers_and_lr_schedulers
optim_conf = model.trainer._call_lightning_module_hook("configure_optimizers", pl_module=model)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1342, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/home/zhangshen/colossalAI/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 1377, in configure_optimizers
opt = HybridAdam(params, lr=lr)
File "/usr/local/lib/python3.8/dist-packages/colossalai/nn/optimizer/hybrid_adam.py", line 83, in init
raise ImportError('Please install colossalai from source code to use HybridAdam')
ImportError: Please install colossalai from source code to use HybridAdam
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "main.py", line 807, in
melk()
File "main.py", line 790, in melk
trainer.save_checkpoint(ckpt_path)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1941, in save_checkpoint
self._checkpoint_connector.save_checkpoint(filepath, weights_only=weights_only, storage_options=storage_options)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 539, in save_checkpoint
_checkpoint = self.dump_checkpoint(weights_only)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 471, in dump_checkpoint
"state_dict": self._get_lightning_module_state_dict(),
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 553, in _get_lightning_module_state_dict
state_dict = self.trainer.strategy.lightning_module_state_dict()
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/strategies/colossalai.py", line 383, in lightning_module_state_dict
assert isinstance(self.model, ZeroDDP)
AssertionError
But I actually install colossalai by sudo pip3 install colossalai==0.1.12+torch1.12cu11.3 -f https://release.colossalai.org.
Environment
No response
You can try installing from pypi via pip install colossalai directly. We have made the PyTorch extensions optional and will only build an extension when it is actually required by the program.
Hi, I try to change the version of CUDA, pytorch and colossalai and succeed. Here is my version:
CUDA : 11.2
CUDNN: 8.1.0
pytorch: 1.11.0+cu113
colossalai: 0.1.10+torch1.11cu11.3
pytorch-lightning: 1.8.6
But I meet another problem when running train_colossalai_cifar10.yaml. I can train by a single GPU. But when training with multi GPUs, an error occurs. By the way,I can run train_ddp.yaml with multi GPUs. I have put this issue in #2505. Please help me.