🐛 Describe the bug
when running train_colossalai_cifar10.yaml, I can train by a single GPU. But when training with multi GPUs (devices:2), an error occurs.
Setting up LambdaLR scheduler...
Setting up LambdaLR scheduler...
Traceback (most recent call last):
File "/home/zhangshen/colossalAI/examples/images/diffusion/main.py", line 805, in
trainer.fit(model, data)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit
call._call_and_handle_interrupt(
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1079, in _run
self.strategy.setup(self)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/strategies/colossalai.py", line 334, in setup
self.model_to_device()
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/strategies/colossalai.py", line 341, in model_to_device
child.to(self.root_device)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 908, in to
return self._apply(convert)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 578, in _apply
module._apply(fn)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 578, in _apply
module._apply(fn)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 578, in _apply
module._apply(fn)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 601, in _apply
param_applied = fn(param)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 906, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
File "/usr/local/lib/python3.8/dist-packages/colossalai/tensor/colo_parameter.py", line 74, in torch_function
return super().torch_function(func, types, args, kwargs)
File "/usr/local/lib/python3.8/dist-packages/colossalai/tensor/colo_tensor.py", line 170, in torch_function
ret = func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/colossalai/nn/_ops/element_wise.py", line 21, in elementwise_op
output = op(input_tensor, *args, **kwargs)
RuntimeError: CUDA error: invalid argument
/usr/local/lib/python3.8/dist-packages/pytorch_lightning/strategies/ddp.py:438: UserWarning: Error handling mechanism for deadlock detection is uninitialized. Skipping check.
rank_zero_warn("Error handling mechanism for deadlock detection is uninitialized. Skipping check.")
By the way, I can run train_ddp.yaml with multi GPUs. I want your help, thanks!
Environment
python: 3.8.10
CUDA : 11.2
CUDNN: 8.1.0
pytorch: 1.11.0+cu113
colossalai: 0.1.10+torch1.11cu11.3
pytorch-lightning: 1.8.6
Thank you for your feedback, we will deal with it as soon as possible after the holidays. @Fazziekey will help you.
In this case, the error seems to be related to an element-wise operation being performed on a CUDA tensor. It's possible that there is an issue with the input tensor being passed to the operation, or that there is a problem with the operation itself. Additionally, the error message includes a warning about an uninitialized error handling mechanism for deadlock detection, which may or may not be related to the main error. It may be helpful to check the shape and data type of the input tensors and to make sure that they are compatible with the operations being performed.
Note that the recommended version for training is
colossalai: 0.2.5+torch1.11cu11.3
pytorch-lightning: 1.9.x
please follow the updated README file and set up the environment. Multi-GPU training for the target yaml file works during our tests in this environment. Should you still have any other questions after the trial, feel free to reach out then.