ColossalAI
ColossalAI copied to clipboard
[BUG]: Diffusion Fine Tuning
π Describe the bug
Hi Colossal AI team,
I have been trying to get the example for fine tuning on the cifar10 dataset to work from the diffusion example. I have followed the instructions in the readme for the diffusion example and the training seems to start, but it always seem to fail because an issue with 'HybridAdam' not being available.
So for what I have tried:
- Using different pytorch versions, like 13.1
- Using the latest colossal ai version 0.1.12, besides ofcourse the 0.1.10 version the readme stated
This is the error I get when running the cifar 10 example so with train_colossalai_cifar10.yaml
:
Error log
` ....'vision_model.encoder.layers.2.layer_norm1.bias', 'vision_model.encoder.layers.2.layer_norm2.weight', 'vision_model.encoder.layers.23.mlp.fc2.weight', 'vision_model.encoder.layers.19.mlp.fc1.weight'] - This IS expected if you are initializing CLIPTextModelZero from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing CLIPTextModelZero from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Summoning checkpoint. Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/colossalai/nn/optimizer/hybrid_adam.py", line 80, in __init__ import colossalai._C.cpu_optim ImportError: libtorch_cuda_cu.so: cannot open shared object file: No such file or directoryDuring handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "main.py", line 808, in
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "main.py", line 810, in
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "main.py", line 830, in
Environment
I am on colab which does use cuda 11.2. I tried upgrading to 11.3, but unfortunately with no luck
!nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2021 NVIDIA Corporation Built on Sun_Feb_14_21:12:58_PST_2021 Cuda compilation tools, release 11.2, V11.2.152 Build cuda_11.2.r11.2/compiler.29618528_0
import torch
print(torch.version)
torch.cuda.is_available()
1.11.0+cu102 True
Thanks for your issue, The problem of HybridAdam means you can try install colossalai from source code. Meanwhile, because Colossalai is a rapidly updating project, some versions may conflict with lightning, we will provide a stable version of colossal and lightning in the near future to ensure the code can work properlyγ
@Fazziekey
Hi, I have tried many ways to let ColossalAI run as a lightning plugin. But I cannot success and finally get the error: ImportError: Please install colossalai from source code to use HybridAdam
I list my ways for you to ref. Hope it is useful.
Method 1: Directly install
Use conda env
- Cuda version: 11.7
- Pytorch: 1.13.0
- Lightning: current version (1.8.0)
- ColossalAI: 0.1.12
Method 2: Directly install
Use conda env
- Cuda version: 11.3
- Pytorch: 1.12.0
- Lightning: current version (1.8.0)
- ColossalAI: 0.1.12
Method 3: Directly install
Use conda env
- Cuda version: 11.3
- Pytorch: 1.12.0
- Lightning: current version (1.8.0)
- ColossalAI: 0.1.10
Method 4: Compiled install
Use conda env
- Cuda version: 11.3
- Pytorch: 1.12.0
- Lightning: Least (1.8.0)
- apex: current version
- ColossalAI (From source): 0.1.12
Method 5: Compiled install
Use conda env
- Cuda version: 11.3
- Pytorch: 1.12.0
- Lightning: Least (1.8.0)
- apex: current version
- ColossalAI (From source): 0.1.10
Method 6: Compiled install
Use container: from hpcaitech/cuda-conda:11.3 Default container python env
- Cuda version: 11.3
- Pytorch: 1.12.0
- Lightning: Least (1.8.0)
- apex: current version
- ColossalAI (From source): 0.1.10
I guess this error is caused by the python module management system, but Pytorchlightning. I will test my network on ColossalAI's trainer to ensure if the problem still exists.
And by the way, when I change the colossal version from 0.1.10 to 0.1.12 (the newest), I got the error: #1872
@Fazziekey
Following the last step, I re-do Method 6 and successfully involve the ColossalAI and the HybridAdam. But I got a wired error, as below:
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Global seed set to 432
[2022-12-13 03:12:14,760][ProcessGroup][INFO] - ~/.local/lib/python3.9/site-packages/colossalai/tensor/process_group.py:24 get
[2022-12-13 03:12:14,760][ProcessGroup][INFO] - NCCL initialize ProcessGroup on [0]
Error executing job with overrides: ['trainer=colossalai2g']
Traceback (most recent call last):
File "USERFILE", line 94, in run
trainer.fit(model, task)
File "~/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit
call._call_and_handle_interrupt(
File "~/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "~/.local/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 90, in launch
return function(*args, **kwargs)
File "~/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "~/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1083, in _run
self._call_callback_hooks("on_fit_start")
File "~/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1380, in _call_callback_hooks
fn(self, self.lightning_module, *args, **kwargs)
File "~/.local/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_summary.py", line 59, in on_fit_start
model_summary = self._summary(trainer, pl_module)
File "~/.local/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_summary.py", line 73, in _summary
return summarize(pl_module, max_depth=self._max_depth)
File "~/.local/lib/python3.9/site-packages/pytorch_lightning/utilities/model_summary/model_summary.py", line 431, in summarize
return ModelSummary(lightning_module, max_depth=max_depth)
File "~/.local/lib/python3.9/site-packages/pytorch_lightning/utilities/model_summary/model_summary.py", line 189, in __init__
self._layer_summary = self.summarize()
File "~/.local/lib/python3.9/site-packages/pytorch_lightning/utilities/model_summary/model_summary.py", line 246, in summarize
self._forward_example_input()
File "~/.local/lib/python3.9/site-packages/pytorch_lightning/utilities/model_summary/model_summary.py", line 278, in _forward_example_input
model(input_)
File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "~/Documents/novozymes_enzyme/src/model/DeepET.py", line 150, in forward
return self.model(x)
File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1148, in _call_impl
result = forward_call(*input, **kwargs)
File "~/Documents/novozymes_enzyme/src/model/DeepET.py", line 92, in forward
x = self.conv1(x)
File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/container.py", line 139, in forward
input = module(input)
File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 307, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 303, in _conv_forward
return F.conv1d(input, weight, bias, self.stride,
File "~/.local/lib/python3.9/site-packages/colossalai/tensor/colo_parameter.py", line 74, in __torch_function__
return super().__torch_function__(func, types, args, kwargs)
File "~/.local/lib/python3.9/site-packages/colossalai/tensor/colo_tensor.py", line 170, in __torch_function__
ret = func(*args, **kwargs)
RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.HalfTensor) should be the same
colossalai check
Apptainer> colossalai check -i CUDA Version: 11.3 PyTorch Version: 1.12.1 CUDA Version in PyTorch Build: 11.3 PyTorch CUDA Version Match: β CUDA Extension: β
Do you have any advise about this problem?
https://github.com/hpcaitech/ColossalAI/issues/2114#issuecomment-1347702320
Setting placement_policy="cuda"
seems to work.
ColossalAIStrategy(enable_distributed_storage=False, placement_policy="cuda")
Hi, I meet same problem. If I use colossalai 0.1.12, the error is "takes from 2 to 3 positional arguments but 5 were given". If I use colossalai 0.1.10, the error is "Please install colossalai from source code to use HybridAdam". How do you solve this? Thanks!@wurining
run pip install colossalai==0.1.11rc5+torch1.12cu11.3 -f https://release.colossalai.org
I try this and it has the same error like 0.1.12:"takes from 2 to 3 positional arguments but 5 were given". @Fazziekey
I try this and it has the same error like 0.1.12:"takes from 2 to 3 positional arguments but 5 were given". @Fazziekey
I meet the same problem when pip install colossalai==0.1.10+torch1.12cu11.3 -f https://release.colossalai.org
takes from 2 to 3 positional arguments but 5 were given"
it seems you pass too much arguments to a colossal function
@flynnamy
In my experience, four things you need to check.
- install the correct version of pytorch;
- use
nvcc -V
to ensure your cuda and cudatoolkit version are correct to that pytorch rely on; - installed the apex;
- compiled installation of colossal 1.10
hope that is useful :)
@Fazziekey Could you tell me, which branch are you developing on?
my env: Cuda version: 11.3 Pytorch: 1.12.0 pytorch-lightning: 1.9.0.dev0 (from 1SAA git) colossalai: 0.1.10+torch1.12cu11.3 (compile from source, 0.1.11, 0.1.12, 0.1.13 meets all the same error)
β
β 269 β β β .reshape(B, out.shape[1], C)
β 270 β β )
β 271 β β out = rearrange(out, 'b (h w) c -> b c h w', b=B, h=H, w=W, c=C)
β β± 272 β β out = self.proj_out(out)
β 273 β β return x+out
β 167 β β β func = _COLOSSAL_OPS[func]
β 168 β β
β 169 β β with torch._C.DisableTorchFunction():
β β± 170 β β β ret = func(*args, **kwargs)
β 171 β β β if func in _get_my_nowrap_functions():
β 172 β β β β return ret
β 173 β β β else:
RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.cuda.HalfTensor) should be the same
It locates at: /diffusion/ldm/modules/diffusionmodules/model.py:272
before enter self.proj_out
, the out.dtype
is float32, but self.proj_out.weight.data.dtype
is float16
Is the problem in the control of mixed precisionοΌ
I found out that the problem was caused because of xformer, when I turned it off, the finetuning code could be run
We have updated a lot. This issue was closed due to inactivity. Thanks.