ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: RuntimeError: CUDA error: invalid argument

Open ShenZhang-Shin opened this issue 2 years ago • 2 comments

🐛 Describe the bug

when running train_colossalai_cifar10.yaml, I can train by a single GPU. But when training with multi GPUs (devices:2), an error occurs.

Setting up LambdaLR scheduler... Setting up LambdaLR scheduler... Traceback (most recent call last): File "/home/zhangshen/colossalAI/examples/images/diffusion/main.py", line 805, in trainer.fit(model, data) File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit call._call_and_handle_interrupt( File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt return trainer_fn(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl self._run(model, ckpt_path=self.ckpt_path) File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1079, in _run self.strategy.setup(self) File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/strategies/colossalai.py", line 334, in setup self.model_to_device() File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/strategies/colossalai.py", line 341, in model_to_device child.to(self.root_device) File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 908, in to return self._apply(convert) File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 578, in _apply module._apply(fn) File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 578, in _apply module._apply(fn) File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 578, in _apply module._apply(fn) File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 601, in _apply param_applied = fn(param) File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 906, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) File "/usr/local/lib/python3.8/dist-packages/colossalai/tensor/colo_parameter.py", line 74, in torch_function return super().torch_function(func, types, args, kwargs) File "/usr/local/lib/python3.8/dist-packages/colossalai/tensor/colo_tensor.py", line 170, in torch_function ret = func(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/colossalai/nn/_ops/element_wise.py", line 21, in elementwise_op output = op(input_tensor, *args, **kwargs) RuntimeError: CUDA error: invalid argument /usr/local/lib/python3.8/dist-packages/pytorch_lightning/strategies/ddp.py:438: UserWarning: Error handling mechanism for deadlock detection is uninitialized. Skipping check. rank_zero_warn("Error handling mechanism for deadlock detection is uninitialized. Skipping check.")

By the way, I can run train_ddp.yaml with multi GPUs. I want your help, thanks!

Environment

python: 3.8.10 CUDA : 11.2
CUDNN: 8.1.0 pytorch: 1.11.0+cu113 colossalai: 0.1.10+torch1.11cu11.3 pytorch-lightning: 1.8.6

ShenZhang-Shin avatar Jan 19 '23 17:01 ShenZhang-Shin

Thank you for your feedback, we will deal with it as soon as possible after the holidays. @Fazziekey will help you.

binmakeswell avatar Jan 28 '23 06:01 binmakeswell

In this case, the error seems to be related to an element-wise operation being performed on a CUDA tensor. It's possible that there is an issue with the input tensor being passed to the operation, or that there is a problem with the operation itself. Additionally, the error message includes a warning about an uninitialized error handling mechanism for deadlock detection, which may or may not be related to the main error. It may be helpful to check the shape and data type of the input tensors and to make sure that they are compatible with the operations being performed.

Note that the recommended version for training is colossalai: 0.2.5+torch1.11cu11.3 pytorch-lightning: 1.9.x please follow the updated README file and set up the environment. Multi-GPU training for the target yaml file works during our tests in this environment. Should you still have any other questions after the trial, feel free to reach out then.

NatalieC323 avatar Apr 18 '23 08:04 NatalieC323