ColossalAI [BUG]: Dreambooth example failed when training with batchsize>=2. RuntimeError: CUDA error: invalid argument

🐛 Describe the bug

Firstly thanks for releasing codes fo boosting and memory saving. When I try to use the code of training dreambooth from example, something weird happened. When I use batchsize=1 without prior_preservation_loss, the whole finetuning works perfectly fine. But it goes down when I use batchsize=2 or prior_preservation_loss. The traceback is as followed: ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /data/diffusion_training/ColossalAI/examples/images/dreambooth/train_dreambooth_colossalai.py:68 │
│ 8 in │
│ │
│ 685 │
│ 686 if name == "main": │
│ 687 │ args = parse_args() │
│ ❱ 688 │ main(args) │ │ 689 │ │ │ │ /data/diffusion_training/ColossalAI/examples/images/dreambooth/train_dreambooth_colossalai.py:64 │ │ 0 in main │ │ │ │ 637 │ │ │ │ print("loss:{}".format(loss)) │ │ 638 │ │ │ │ # print("loss dtype:{}".format(loss.dtype)) │ │ 639 │ │ │ │ │ ❱ 640 │ │ │ optimizer.backward(loss) │ │ 641 │ │ │ │ │ 642 │ │ │ optimizer.step() │ │ 643 │ │ │ lr_scheduler.step() │ │ │ │ /home/bingzhaodong/miniconda3/envs/diffusion/lib/python3.8/site-packages/colossalai/nn/optimizer │ │ /zero_optimizer.py:241 in backward │ │ │ │ 238 │ def backward(self, loss: torch.Tensor): │ │ 239 │ │ loss = self.loss_scale * loss │ │ 240 │ │ self.optim_state = OptimState.SCALED │ │ ❱ 241 │ │ self.module.backward(loss) │ │ 242 │ │ │ 243 │ def backward_by_grad(self, tensor: torch.Tensor, grad: torch.Tensor): │ │ 244 │ │ # This function is called except the last stage of pipeline parallel │ │ ││ │ │ /home/bingzhaodong/miniconda3/envs/diffusion/lib/python3.8/site-packages/colossalai/nn/parallel/ │
│ data_parallel.py:323 in backward │
│ │
│ 320 │ def backward(self, loss: torch.Tensor): │
│ 321 │ │ self._pre_backward() │
│ 322 │ │ with self.param_op_hook.switch_to_backward(), ColoParamOpHookManager.use_hooks(s │ │ ❱ 323 │ │ │ loss.backward() │ │ 324 │ │ self._post_backward() │ │ 325 │ │ │ 326 │ def backward_by_grad(self, tensor, grad): │ │ │ │ /home/bingzhaodong/miniconda3/envs/diffusion/lib/python3.8/site-packages/torch/tensor.py:355 in │ │ backward │ │ │ │ 352 │ │ │ │ used to compute the attr::tensors. │ │ 353 │ │ """ │ │ 354 │ │ if has_torch_function_unary(self): │ │ ❱ 355 │ │ │ return handle_torch_function( │ │ 356 │ │ │ │ Tensor.backward, │ │ 357 │ │ │ │ (self,), │ │ 358 │ │ │ │ self, │ │ │ │ /home/bingzhaodong/miniconda3/envs/diffusion/lib/python3.8/site-packages/torch/overrides.py:1394 │ │ in handle_torch_function │ │ │ │ 1391 │ │ │ │ 1392 │ │ # Use public_api instead of implementation so torch_function │ │ 1393 │ │ # implementations can do equality/identity comparisons. │ │ ❱ 1394 │ │ result = torch_func_method(public_api, types, args, kwargs) │ │ 1395 │ │ │ │ 1396 │ │ if result is not NotImplemented: │ │ 1397 │ │ │ return result │ │ │ │ /home/bingzhaodong/miniconda3/envs/diffusion/lib/python3.8/site-packages/colossalai/tensor/colo │ │ tensor.py:193 in torch_function │ │ │ │ 190 │ │ │ │ return backward_tensor.backward(**tensor_kwargs) │ │ 191 │ │ │ │ 192 │ │ with torch._C.DisableTorchFunction(): │ │ ❱ 193 │ │ │ ret = func(*args, **kwargs) │ │ 194 │ │ │ if func in _get_my_nowrap_functions(): │ │ 195 │ │ │ │ return ret │ │ 196 │ │ │ else: │ │ │ │ /home/bingzhaodong/miniconda3/envs/diffusion/lib/python3.8/site-packages/torch/_tensor.py:363 in │ │ │ │ 360 │ │ │ │ retain_graph=retain_graph, │ │ 361 │ │ │ │ create_graph=create_graph, │ │ 362 │ │ │ │ inputs=inputs) │ │ ❱ 363 │ │ torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=input │ │ 364 │ │ │ 365 │ def register_hook(self, hook): │ │ 366 │ │ r"""Registers a backward hook. │ │ │ │ /home/bingzhaodong/miniconda3/envs/diffusion/lib/python3.8/site-packages/torch/autograd/init │ │ .py:180 in backward │ │ │ │ ❱ 180 │ Variable.execution_engine.run_backward( # Calls into the C++ engine to run the bac │ │ 181 │ │ tensors, grad_tensors, retain_graph, create_graph, inputs, │ │ 182 │ │ allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to ru │ │ 183 │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ RuntimeError: CUDA error: invalid argument

I guess this bug may come from some kind of colo_tensor behavior. Can you look at it? Thanks.

Environment

python3.8+torch1.11+cu11.3

Apr 17 '23 03:04 aaab8b

The error message suggests a CUDA error with an invalid argument. This could be caused by a few different things, but it's possible that the issue is related to the batch size or the use of prior preservation loss.

You mentioned that the issue occurs when you use batch size 2 or when you include prior preservation loss. It's possible that the larger batch size is causing memory issues on your GPU, leading to the CUDA error. You could try reducing the batch size to see if that resolves the issue.

Another possibility is that the prior preservation loss is causing numerical instability or overflow issues. You could try reducing the weight of the prior preservation loss or removing it altogether to see if that resolves the issue.

Apr 18 '23 07:04 NatalieC323

The error message suggests a CUDA error with an invalid argument. This could be caused by a few different things, but it's possible that the issue is related to the batch size or the use of prior preservation loss.

You mentioned that the issue occurs when you use batch size 2 or when you include prior preservation loss. It's possible that the larger batch size is causing memory issues on your GPU, leading to the CUDA error. You could try reducing the batch size to see if that resolves the issue.

Another possibility is that the prior preservation loss is causing numerical instability or overflow issues. You could try reducing the weight of the prior preservation loss or removing it altogether to see if that resolves the issue.

@NatalieC323

Both two possible solutions you mentioned in your comment have been tried. None of them worked. I've tried to reduce image size and check my gpu peak memory. The errors are the same. And if the error is caused by prior_preservation's stabilty, then batch_size=2 without prior_preservation should be fine. And it turns out it's not working. Can you look at it? Thanks.

Apr 19 '23 08:04 aaab8b

The error message suggests a CUDA error with an invalid argument. This could be caused by a few different things, but it's possible that the issue is related to the batch size or the use of prior preservation loss.

You mentioned that the issue occurs when you use batch size 2 or when you include prior preservation loss. It's possible that the larger batch size is causing memory issues on your GPU, leading to the CUDA error. You could try reducing the batch size to see if that resolves the issue.

Another possibility is that the prior preservation loss is causing numerical instability or overflow issues. You could try reducing the weight of the prior preservation loss or removing it altogether to see if that resolves the issue.

@NatalieC323 By days of debugging, it turns out that the minimum version of diffusers requested by this script is not 0.5.0. Anyone facing the same situation can just upgrade diffusers to the latest version.

Apr 22 '23 09:04 aaab8b

ColossalAI ColossalAI copied to clipboard

[BUG]: Dreambooth example failed when training with batchsize>=2. RuntimeError: CUDA error: invalid argument

🐛 Describe the bug

Environment

ColossalAI
ColossalAI copied to clipboard