ColossalAI
ColossalAI copied to clipboard
[BUG]: Diffusion training bug RuntimeError: CUDA error: no kernel image is available for execution on the device
🐛 Describe the bug
I have no idea why it has a bug "RuntimeError: CUDA error: no kernel image is available for execution on the device" when I am training the latent diffusion model in a super-resolution task. I really appreciate it if you could help me out.
Lightning config trainer: accelerator: gpu devices: 1 log_gpu_memory: all max_epochs: 3 precision: 16 auto_select_gpus: false strategy: target: strategies.ColossalAIStrategy params: use_chunk: true enable_distributed_storage: true placement_policy: cuda force_outputs_fp32: true log_every_n_steps: 3 logger: true default_root_dir: /tmp/diff_log/ logger_config: wandb: target: loggers.WandbLogger params: name: nowname save_dir: /tmp/diff_log/ offline: opt.debug id: nowname
/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loggers/tensorboard.py:248: UserWarning: Could not log computational graph since the model.example_input_array
attribute is not set or input_array
was not given
rank_zero_warn(
/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:224: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers
argument(try 12 which is the number of cpus on this machine) in the
DataLoaderinit to improve performance. rank_zero_warn( Epoch 0: 0%| | 0/42156 [00:00<?, ?it/s]/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:233: UserWarning: You called
self.log('global_step', ...)in your
training_step` but the value needs to be floating point. Converting it to torch.float32.
warning_cache.warn(
Summoning checkpoint.
[12/17/22 18:52:57] INFO colossalai - ProcessGroup - INFO:
/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/colossalai/tensor/process_group.py:24
get
INFO colossalai - ProcessGroup - INFO: NCCL initialize ProcessGroup on [0]
Traceback (most recent call last):
File "/home/liuchaowei/ColossalAI/examples/images/diffusion/main_ISP.py", line 805, in
Environment
pytorch:1.12.1 cuda:11.3 pytorch-lightning:1.8.0
it can work if I pip intsall pytorch==1.11.0+cu113, but come another problem!
/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/pytorch_lightning/loggers/tensorboard.py:248: UserWarning: Could not log computational graph since the `model.example_input_array` attribute is not set or `input_array` was not given
rank_zero_warn(
/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:224: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 12 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
rank_zero_warn(
Epoch 0: 0%| | 0/42156 [00:00<?, ?it/s]/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:233: UserWarning: You called `self.log('global_step', ...)` in your `training_step` but the value needs to be floating point. Converting it to torch.float32.
warning_cache.warn(
[12/18/22 13:03:17] INFO colossalai - colossalai - INFO:
/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/colossalai/zero/zero_optimizer.py:137
step
INFO colossalai - colossalai - INFO: Found overflow. Skip step
/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/torch/optim/lr_scheduler.py:131: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
Epoch 0: 0%| | 1/42156 [00:02<24:24:45, 2.08s/it, loss=1.47, v_num=0, train/loss_simple_step=1.470, train/loss_vlb_step=1.470, [12/18/22 13:03:19] INFO colossalai - colossalai - INFO:
/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/colossalai/zero/zero_optimizer.py:137
step
INFO colossalai - colossalai - INFO: Found overflow. Skip step
Epoch 0: 0%| | 2/42156 [00:03<22:15:38, 1.90s/it, loss=2.81, v_num=0, train/loss_simple_step=4.140, train/loss_vlb_step=4.140, Summoning checkpoint.
[12/18/22 13:03:22] INFO colossalai - ProcessGroup - INFO:
/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/colossalai/tensor/process_group.py:24
get
INFO colossalai - ProcessGroup - INFO: NCCL initialize ProcessGroup on [0]
Traceback (most recent call last):
File "/home/liuchaowei/ColossalAI/examples/images/diffusion/main_ISP.py", line 805, in <module>
trainer.fit(model, data)
File "/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 579, in fit
call._call_and_handle_interrupt(
File "/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 90, in launch
return function(*args, **kwargs)
File "/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 621, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1058, in _run
results = self._run_stage()
File "/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1137, in _run_stage
self._run_train()
File "/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1160, in _run_train
self.fit_loop.run()
File "/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 214, in advance
batch_output = self.batch_loop.run(kwargs)
File "/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance
outputs = self.optimizer_loop.run(optimizers, kwargs)
File "/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 200, in advance
result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position])
File "/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 247, in _run_optimization
self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure)
File "/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 357, in _optimizer_step
self.trainer._call_lightning_module_hook(
File "/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1302, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/pytorch_lightning/core/module.py", line 1661, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py", line 169, in step
step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
File "/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/pytorch_lightning/strategies/colossalai.py", line 368, in optimizer_step
return self.precision_plugin.optimizer_step(
File "/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/colossalai.py", line 81, in optimizer_step
optimizer.step()
File "/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/colossalai/zero/zero_optimizer.py", line 142, in step
ret = self.optim.step(*args, **kwargs)
File "/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/torch/optim/lr_scheduler.py", line 65, in wrapper
return wrapped(*args, **kwargs)
File "/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/torch/optim/optimizer.py", line 88, in wrapper
return func(*args, **kwargs)
File "/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/colossalai/nn/optimizer/hybrid_adam.py", line 143, in step
multi_tensor_applier(self.gpu_adam_op, self._dummy_overflow_buf, [g_l, p_l, m_l, v_l], group['lr'],
File "/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/colossalai/utils/multi_tensor_apply/multi_tensor_apply.py", line 35, in __call__
return op(self.chunk_size,
RuntimeError: Cannot access data pointer of Tensor that doesn't have storage
Exception raised from data at /opt/conda/envs/3.9/lib/python3.9/site-packages/torch/include/c10/core/TensorImpl.h:1178 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fe2a01e27d2 in /home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x5f (0x7fe2a01def3f in /home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0x21d1b (0x7fe2418fad1b in /home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/colossal_C.cpython-39-x86_64-linux-gnu.so)
frame #3: multi_tensor_adam_cuda(int, at::Tensor, std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > >, float, float, float, float, int, int, int, float) + 0x2e9 (0x7fe2418fb569 in /home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/colossal_C.cpython-39-x86_64-linux-gnu.so)
frame #4: <unknown function> + 0x1c211 (0x7fe2418f5211 in /home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/colossal_C.cpython-39-x86_64-linux-gnu.so)
frame #5: <unknown function> + 0x1819c (0x7fe2418f119c in /home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/colossal_C.cpython-39-x86_64-linux-gnu.so)
<omitting python frames>
It seems your Cuda driver is not right
We have updated a lot. This issue was closed due to inactivity. Thanks.