ColossalAI [BUG]: The error of running diffusion: ImportError: Please install colossalai from source code to use HybridAdam

🐛 Describe the bug

Hello, Thanks for the remarkable improvement of diffusion. I have some problems when running diffusion code and want your help.

I run sudo pip3 install torch==1.12.0+cu102 torchvision==0.13.0+cu102 torchaudio==0.12.0 --extra-index-url https://download.pytorch.org/whl/cu102 sudo pip3 install transformers==4.19.2 diffusers invisible-watermark sudo pip3 install -e . and install lightning from https://github.com/Lightning-AI/lightning.git and install Install Colossal-AI by sudo pip3 install colossalai==0.1.12+torch1.12cu11.3 -f https://release.colossalai.org

Then I run train_colossalai.sh but meet a error about the optimizer:

Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/colossalai/nn/optimizer/hybrid_adam.py", line 81, in init import colossalai._C.fused_optim ImportError: /usr/local/lib/python3.8/dist-packages/colossalai/_C/fused_optim.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops19empty_memory_format4callEN3c108ArrayRefIlEENS2_8optionalINS2_10ScalarTypeEEENS5_INS2_6LayoutEEENS5_INS2_6DeviceEEENS5_IbEENS5_INS2_12MemoryFormatEEE

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "main.py", line 805, in trainer.fit(model, data) File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit call._call_and_handle_interrupt( File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs) File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 90, in launch return function(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl self._run(model, ckpt_path=self.ckpt_path) File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1079, in _run self.strategy.setup(self) File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/strategies/colossalai.py", line 332, in setup self.setup_optimizers(trainer) File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/strategies/strategy.py", line 142, in setup_optimizers self.optimizers, self.lr_scheduler_configs, self.optimizer_frequencies = _init_optimizers_and_lr_schedulers( File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/core/optimizer.py", line 180, in _init_optimizers_and_lr_schedulers optim_conf = model.trainer._call_lightning_module_hook("configure_optimizers", pl_module=model) File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1342, in _call_lightning_module_hook output = fn(*args, **kwargs) File "/home/zhangshen/colossalAI/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 1377, in configure_optimizers opt = HybridAdam(params, lr=lr) File "/usr/local/lib/python3.8/dist-packages/colossalai/nn/optimizer/hybrid_adam.py", line 83, in init raise ImportError('Please install colossalai from source code to use HybridAdam') ImportError: Please install colossalai from source code to use HybridAdam

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "main.py", line 807, in melk() File "main.py", line 790, in melk trainer.save_checkpoint(ckpt_path) File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1941, in save_checkpoint self._checkpoint_connector.save_checkpoint(filepath, weights_only=weights_only, storage_options=storage_options) File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 539, in save_checkpoint _checkpoint = self.dump_checkpoint(weights_only) File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 471, in dump_checkpoint "state_dict": self._get_lightning_module_state_dict(), File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 553, in _get_lightning_module_state_dict state_dict = self.trainer.strategy.lightning_module_state_dict() File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/strategies/colossalai.py", line 383, in lightning_module_state_dict assert isinstance(self.model, ZeroDDP) AssertionError

But I actually install colossalai by sudo pip3 install colossalai==0.1.12+torch1.12cu11.3 -f https://release.colossalai.org.

Environment

No response

Jan 12 '23 10:01 ShenZhang-Shin

Two possible solutions:

Re-install Pytorch to make sure its cuda is also cu11.3. It is necessaray.
Install ColossalAI from the source, the latest README.md has provided the instruction.

Jan 12 '23 11:01 haofanwang

You can try installing from pypi via pip install colossalai directly. We have made the PyTorch extensions optional and will only build an extension when it is actually required by the program.

Jan 13 '23 02:01 FrankLeeeee

Hi, I try to change the version of CUDA, pytorch and colossalai and succeed. Here is my version: CUDA : 11.2
CUDNN: 8.1.0 pytorch: 1.11.0+cu113 colossalai: 0.1.10+torch1.11cu11.3 pytorch-lightning: 1.8.6

But I meet another problem when running train_colossalai_cifar10.yaml. I can train by a single GPU. But when training with multi GPUs, an error occurs. By the way,I can run train_ddp.yaml with multi GPUs. I have put this issue in #2505. Please help me.

Jan 19 '23 17:01 ShenZhang-Shin

Nice. Thanks.

Apr 18 '23 08:04 binmakeswell