ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: Diffusion Fine Tuning

Open vazkir opened this issue 2 years ago β€’ 12 comments

πŸ› Describe the bug

Hi Colossal AI team,

I have been trying to get the example for fine tuning on the cifar10 dataset to work from the diffusion example. I have followed the instructions in the readme for the diffusion example and the training seems to start, but it always seem to fail because an issue with 'HybridAdam' not being available.

So for what I have tried:

  • Using different pytorch versions, like 13.1
  • Using the latest colossal ai version 0.1.12, besides ofcourse the 0.1.10 version the readme stated

This is the error I get when running the cifar 10 example so with train_colossalai_cifar10.yaml:

Error log ` ....'vision_model.encoder.layers.2.layer_norm1.bias', 'vision_model.encoder.layers.2.layer_norm2.weight', 'vision_model.encoder.layers.23.mlp.fc2.weight', 'vision_model.encoder.layers.19.mlp.fc1.weight'] - This IS expected if you are initializing CLIPTextModelZero from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing CLIPTextModelZero from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Summoning checkpoint. Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/colossalai/nn/optimizer/hybrid_adam.py", line 80, in __init__ import colossalai._C.cpu_optim ImportError: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "main.py", line 808, in trainer.fit(model, data) File "/usr/local/lib/python3.8/dist-packages/lightning/pytorch/trainer/trainer.py", line 582, in fit call._call_and_handle_interrupt( File "/usr/local/lib/python3.8/dist-packages/lightning/pytorch/trainer/call.py", line 36, in _call_and_handle_interrupt return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs) File "/usr/local/lib/python3.8/dist-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 90, in launch return function(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/lightning/pytorch/trainer/trainer.py", line 624, in _fit_impl self._run(model, ckpt_path=self.ckpt_path) File "/usr/local/lib/python3.8/dist-packages/lightning/pytorch/trainer/trainer.py", line 1042, in _run self.strategy.setup(self) File "/usr/local/lib/python3.8/dist-packages/lightning/pytorch/strategies/colossalai.py", line 332, in setup self.setup_optimizers(trainer) File "/usr/local/lib/python3.8/dist-packages/lightning/pytorch/strategies/strategy.py", line 142, in setup_optimizers self.optimizers, self.lr_scheduler_configs, self.optimizer_frequencies = _init_optimizers_and_lr_schedulers( File "/usr/local/lib/python3.8/dist-packages/lightning/pytorch/core/optimizer.py", line 180, in _init_optimizers_and_lr_schedulers optim_conf = model.trainer._call_lightning_module_hook("configure_optimizers", pl_module=model) File "/usr/local/lib/python3.8/dist-packages/lightning/pytorch/trainer/trainer.py", line 1305, in _call_lightning_module_hook output = fn(*args, **kwargs) File "/content/drive/MyDrive/Colab Notebooks/CS82/Final Project/fine_tune/colossal-diffusion/ldm/models/diffusion/ddpm.py", line 1452, in configure_optimizers opt = HybridAdam(params, lr=lr) File "/usr/local/lib/python3.8/dist-packages/colossalai/nn/optimizer/hybrid_adam.py", line 83, in init raise ImportError('Please install colossalai from source code to use HybridAdam') ImportError: Please install colossalai from source code to use HybridAdam

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "main.py", line 810, in melk() File "main.py", line 789, in melk trainer.save_checkpoint(ckpt_path) File "/usr/local/lib/python3.8/dist-packages/lightning/pytorch/trainer/trainer.py", line 1904, in save_checkpoint self._checkpoint_connector.save_checkpoint(filepath, weights_only=weights_only, storage_options=storage_options) File "/usr/local/lib/python3.8/dist-packages/lightning/pytorch/trainer/connectors/checkpoint_connector.py", line 539, in save_checkpoint _checkpoint = self.dump_checkpoint(weights_only) File "/usr/local/lib/python3.8/dist-packages/lightning/pytorch/trainer/connectors/checkpoint_connector.py", line 471, in dump_checkpoint "state_dict": self._get_lightning_module_state_dict(), File "/usr/local/lib/python3.8/dist-packages/lightning/pytorch/trainer/connectors/checkpoint_connector.py", line 553, in _get_lightning_module_state_dict state_dict = self.trainer.strategy.lightning_module_state_dict() File "/usr/local/lib/python3.8/dist-packages/lightning/pytorch/strategies/colossalai.py", line 383, in lightning_module_state_dict assert isinstance(self.model, ZeroDDP) AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "main.py", line 830, in print(trainer.profiler.summary()) File "/usr/local/lib/python3.8/dist-packages/lightning/pytorch/profilers/pytorch.py", line 465, in summary self._delete_profilers() File "/usr/local/lib/python3.8/dist-packages/lightning/pytorch/profilers/pytorch.py", line 514, in _delete_profilers self._cache_functions_events() File "/usr/local/lib/python3.8/dist-packages/lightning/pytorch/profilers/pytorch.py", line 506, in _cache_functions_events self.function_events = self.profiler.events() File "/usr/local/lib/python3.8/dist-packages/torch/profiler/profiler.py", line 156, in events assert self.profiler AssertionError `

Environment

I am on colab which does use cuda 11.2. I tried upgrading to 11.3, but unfortunately with no luck

!nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2021 NVIDIA Corporation Built on Sun_Feb_14_21:12:58_PST_2021 Cuda compilation tools, release 11.2, V11.2.152 Build cuda_11.2.r11.2/compiler.29618528_0

import torch print(torch.version) torch.cuda.is_available() 1.11.0+cu102 True

vazkir avatar Dec 11 '22 13:12 vazkir

Thanks for your issue, The problem of HybridAdam means you can try install colossalai from source code. Meanwhile, because Colossalai is a rapidly updating project, some versions may conflict with lightning, we will provide a stable version of colossal and lightning in the near future to ensure the code can work properly。

Fazziekey avatar Dec 12 '22 01:12 Fazziekey

@Fazziekey

Hi, I have tried many ways to let ColossalAI run as a lightning plugin. But I cannot success and finally get the error: ImportError: Please install colossalai from source code to use HybridAdam

I list my ways for you to ref. Hope it is useful.

Method 1: Directly install

Use conda env

  1. Cuda version: 11.7
  2. Pytorch: 1.13.0
  3. Lightning: current version (1.8.0)
  4. ColossalAI: 0.1.12

Method 2: Directly install

Use conda env

  1. Cuda version: 11.3
  2. Pytorch: 1.12.0
  3. Lightning: current version (1.8.0)
  4. ColossalAI: 0.1.12

Method 3: Directly install

Use conda env

  1. Cuda version: 11.3
  2. Pytorch: 1.12.0
  3. Lightning: current version (1.8.0)
  4. ColossalAI: 0.1.10

Method 4: Compiled install

Use conda env

  1. Cuda version: 11.3
  2. Pytorch: 1.12.0
  3. Lightning: Least (1.8.0)
  4. apex: current version
  5. ColossalAI (From source): 0.1.12

Method 5: Compiled install

Use conda env

  1. Cuda version: 11.3
  2. Pytorch: 1.12.0
  3. Lightning: Least (1.8.0)
  4. apex: current version
  5. ColossalAI (From source): 0.1.10

Method 6: Compiled install

Use container: from hpcaitech/cuda-conda:11.3 Default container python env

  1. Cuda version: 11.3
  2. Pytorch: 1.12.0
  3. Lightning: Least (1.8.0)
  4. apex: current version
  5. ColossalAI (From source): 0.1.10

I guess this error is caused by the python module management system, but Pytorchlightning. I will test my network on ColossalAI's trainer to ensure if the problem still exists.

And by the way, when I change the colossal version from 0.1.10 to 0.1.12 (the newest), I got the error: #1872

wurining avatar Dec 12 '22 18:12 wurining

@Fazziekey

Following the last step, I re-do Method 6 and successfully involve the ColossalAI and the HybridAdam. But I got a wired error, as below:

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Global seed set to 432
[2022-12-13 03:12:14,760][ProcessGroup][INFO] - ~/.local/lib/python3.9/site-packages/colossalai/tensor/process_group.py:24 get
[2022-12-13 03:12:14,760][ProcessGroup][INFO] - NCCL initialize ProcessGroup on [0]
Error executing job with overrides: ['trainer=colossalai2g']
Traceback (most recent call last):
  File "USERFILE", line 94, in run
    trainer.fit(model, task)
  File "~/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit
    call._call_and_handle_interrupt(
  File "~/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "~/.local/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 90, in launch
    return function(*args, **kwargs)
  File "~/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "~/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1083, in _run
    self._call_callback_hooks("on_fit_start")
  File "~/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1380, in _call_callback_hooks
    fn(self, self.lightning_module, *args, **kwargs)
  File "~/.local/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_summary.py", line 59, in on_fit_start
    model_summary = self._summary(trainer, pl_module)
  File "~/.local/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_summary.py", line 73, in _summary
    return summarize(pl_module, max_depth=self._max_depth)
  File "~/.local/lib/python3.9/site-packages/pytorch_lightning/utilities/model_summary/model_summary.py", line 431, in summarize
    return ModelSummary(lightning_module, max_depth=max_depth)
  File "~/.local/lib/python3.9/site-packages/pytorch_lightning/utilities/model_summary/model_summary.py", line 189, in __init__
    self._layer_summary = self.summarize()
  File "~/.local/lib/python3.9/site-packages/pytorch_lightning/utilities/model_summary/model_summary.py", line 246, in summarize
    self._forward_example_input()
  File "~/.local/lib/python3.9/site-packages/pytorch_lightning/utilities/model_summary/model_summary.py", line 278, in _forward_example_input
    model(input_)
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "~/Documents/novozymes_enzyme/src/model/DeepET.py", line 150, in forward
    return self.model(x)
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1148, in _call_impl
    result = forward_call(*input, **kwargs)
  File "~/Documents/novozymes_enzyme/src/model/DeepET.py", line 92, in forward
    x = self.conv1(x)
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 307, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 303, in _conv_forward
    return F.conv1d(input, weight, bias, self.stride,
  File "~/.local/lib/python3.9/site-packages/colossalai/tensor/colo_parameter.py", line 74, in __torch_function__
    return super().__torch_function__(func, types, args, kwargs)
  File "~/.local/lib/python3.9/site-packages/colossalai/tensor/colo_tensor.py", line 170, in __torch_function__
    ret = func(*args, **kwargs)
RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.HalfTensor) should be the same

colossalai check

Apptainer> colossalai check -i CUDA Version: 11.3 PyTorch Version: 1.12.1 CUDA Version in PyTorch Build: 11.3 PyTorch CUDA Version Match: βœ“ CUDA Extension: βœ“

Do you have any advise about this problem?

wurining avatar Dec 13 '22 03:12 wurining

https://github.com/hpcaitech/ColossalAI/issues/2114#issuecomment-1347702320

Setting placement_policy="cuda" seems to work.

ColossalAIStrategy(enable_distributed_storage=False, placement_policy="cuda")

wurining avatar Dec 15 '22 06:12 wurining

Hi, I meet same problem. If I use colossalai 0.1.12, the error is "takes from 2 to 3 positional arguments but 5 were given". If I use colossalai 0.1.10, the error is "Please install colossalai from source code to use HybridAdam". How do you solve this? Thanks!@wurining

flynnamy avatar Dec 15 '22 08:12 flynnamy

run pip install colossalai==0.1.11rc5+torch1.12cu11.3 -f https://release.colossalai.org

Fazziekey avatar Dec 15 '22 09:12 Fazziekey

I try this and it has the same error like 0.1.12:"takes from 2 to 3 positional arguments but 5 were given". @Fazziekey

flynnamy avatar Dec 15 '22 09:12 flynnamy

I try this and it has the same error like 0.1.12:"takes from 2 to 3 positional arguments but 5 were given". @Fazziekey

I meet the same problem when pip install colossalai==0.1.10+torch1.12cu11.3 -f https://release.colossalai.org

lcwLcw123 avatar Dec 15 '22 16:12 lcwLcw123

takes from 2 to 3 positional arguments but 5 were given"

it seems you pass too much arguments to a colossal function

Fazziekey avatar Dec 16 '22 01:12 Fazziekey

@flynnamy

In my experience, four things you need to check.

  1. install the correct version of pytorch;
  2. use nvcc -V to ensure your cuda and cudatoolkit version are correct to that pytorch rely on;
  3. installed the apex;
  4. compiled installation of colossal 1.10

hope that is useful :)

wurining avatar Dec 16 '22 16:12 wurining

@Fazziekey Could you tell me, which branch are you developing on?

my env: Cuda version: 11.3 Pytorch: 1.12.0 pytorch-lightning: 1.9.0.dev0 (from 1SAA git) colossalai: 0.1.10+torch1.12cu11.3 (compile from source, 0.1.11, 0.1.12, 0.1.13 meets all the same error)

β”‚                                                                                                  
β”‚   269 β”‚   β”‚   β”‚   .reshape(B, out.shape[1], C)                                                   
β”‚   270 β”‚   β”‚   )                                                                                  
β”‚   271 β”‚   β”‚   out = rearrange(out, 'b (h w) c -> b c h w', b=B, h=H, w=W, c=C)                   
β”‚ ❱ 272 β”‚   β”‚   out = self.proj_out(out)                                                           
β”‚   273 β”‚   β”‚   return x+out                                                                       


β”‚   167 β”‚   β”‚   β”‚   func = _COLOSSAL_OPS[func]                                                     
β”‚   168 β”‚   β”‚                                                                                      
β”‚   169 β”‚   β”‚   with torch._C.DisableTorchFunction():                                              
β”‚ ❱ 170 β”‚   β”‚   β”‚   ret = func(*args, **kwargs)                                                    
β”‚   171 β”‚   β”‚   β”‚   if func in _get_my_nowrap_functions():                                         
β”‚   172 β”‚   β”‚   β”‚   β”‚   return ret                                                                 
β”‚   173 β”‚   β”‚   β”‚   else:                                                                          

RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.cuda.HalfTensor) should be the same

It locates at: /diffusion/ldm/modules/diffusionmodules/model.py:272 before enter self.proj_out, the out.dtype is float32, but self.proj_out.weight.data.dtype is float16

Is the problem in the control of mixed precision?

ray0809 avatar Dec 21 '22 17:12 ray0809

2114#issuecomment-1361721810

I found out that the problem was caused because of xformer, when I turned it off, the finetuning code could be run

ray0809 avatar Dec 22 '22 03:12 ray0809

We have updated a lot. This issue was closed due to inactivity. Thanks.

binmakeswell avatar Apr 14 '23 08:04 binmakeswell