InternImage icon indicating copy to clipboard operation
InternImage copied to clipboard

CUDA problems during traning on custom data using single GPU

Open Wmelon0325 opened this issue 2 years ago • 5 comments

My development environment is

single gpu A6000 48GB CUDA 11.6 cuDNN 8.2.4 python 3.7.16 torch 1.12.0 torch-geometric 1.7.2 torchaudio 0.12.0 torchvision 0.13.0

I have a problem occurred during training as below.

Traceback (most recent call last): File "train.py", line 255, in main() File "train.py", line 252, in main meta=meta) File "/home/user/anaconda3/envs/Intern/lib/python3.7/site-packages/mmdet/apis/train.py", line 247, in train_detector runner.run(data_loaders, cfg.workflow) File "/home/user/anaconda3/envs/Intern/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run epoch_runner(data_loaders[i], **kwargs) File "/home/user/anaconda3/envs/Intern/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 51, in train self.call_hook('after_train_iter') File "/home/user/anaconda3/envs/Intern/lib/python3.7/site-packages/mmcv/runner/base_runner.py", line 309, in call_hook getattr(hook, fn_name)(self) File "/home/user/anaconda3/envs/Intern/lib/python3.7/site-packages/mmcv/runner/hooks/optimizer.py", line 56, in after_train_iter runner.outputs['loss'].backward() File "/home/user/anaconda3/envs/Intern/lib/python3.7/site-packages/torch/_tensor.py", line 396, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/home/user/anaconda3/envs/Intern/lib/python3.7/site-packages/torch/autograd/init.py", line 175, in backward allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Is this out of memory problem? Could you give me a advice, please ? Thank you

Wmelon0325 avatar Apr 17 '23 00:04 Wmelon0325

Thanks for your feedback. This error may be due to unenough GPU memory. How much GPU memory do you have?


Oh, I see, you have a single gpu A6000 with 48GB memory. May I ask which config file do you use for training?

czczup avatar Apr 17 '23 13:04 czczup

Same problem. Even if I try with the smallest backbone internimage_l_22k_192to384, the error still occurrs.

caozhenxiang-kouji avatar Jun 27 '23 07:06 caozhenxiang-kouji

Hi! I have a same problem. Have you solved this problem?

ghost avatar Jun 27 '23 07:06 ghost

Hi! I have a same problem. Have you solved this problem?

I 've figured it out. It seems that your GPU is too small to run the backward operation. Try with a larger one. I manage to train the InternImage-l with a batchsize of 2 on V100.

caozhenxiang-kouji avatar Jun 28 '23 08:06 caozhenxiang-kouji

I encountered the same error, I only have one V100s card, and I have adjusted the batch size to 1, but still reported this error。Moreover, I also printed the words "error in dcnv3_col2im_cuda: an illegal memory access was accounted for" on my end。Is there a phenomenon similar to mine? May I ask how to solve it?

vvwomen avatar Jan 03 '24 03:01 vvwomen