My development environment is
single gpu A6000 48GB
CUDA 11.6
cuDNN 8.2.4
python 3.7.16
torch 1.12.0
torch-geometric 1.7.2
torchaudio 0.12.0
torchvision 0.13.0
I have a problem occurred during training as below.
Traceback (most recent call last):
File "train.py", line 255, in
main()
File "train.py", line 252, in main
meta=meta)
File "/home/user/anaconda3/envs/Intern/lib/python3.7/site-packages/mmdet/apis/train.py", line 247, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/home/user/anaconda3/envs/Intern/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
epoch_runner(data_loaders[i], **kwargs)
File "/home/user/anaconda3/envs/Intern/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 51, in train
self.call_hook('after_train_iter')
File "/home/user/anaconda3/envs/Intern/lib/python3.7/site-packages/mmcv/runner/base_runner.py", line 309, in call_hook
getattr(hook, fn_name)(self)
File "/home/user/anaconda3/envs/Intern/lib/python3.7/site-packages/mmcv/runner/hooks/optimizer.py", line 56, in after_train_iter
runner.outputs['loss'].backward()
File "/home/user/anaconda3/envs/Intern/lib/python3.7/site-packages/torch/_tensor.py", line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/user/anaconda3/envs/Intern/lib/python3.7/site-packages/torch/autograd/init.py", line 175, in backward
allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Is this out of memory problem?
Could you give me a advice, please ?
Thank you
Thanks for your feedback. This error may be due to unenough GPU memory. How much GPU memory do you have?
Oh, I see, you have a single gpu A6000 with 48GB memory. May I ask which config file do you use for training?
Same problem.
Even if I try with the smallest backbone internimage_l_22k_192to384, the error still occurrs.
Hi! I have a same problem. Have you solved this problem?
Hi! I have a same problem. Have you solved this problem?
I 've figured it out. It seems that your GPU is too small to run the backward operation. Try with a larger one. I manage to train the InternImage-l with a batchsize of 2 on V100.
I encountered the same error, I only have one V100s card, and I have adjusted the batch size to 1, but still reported this error。Moreover, I also printed the words "error in dcnv3_col2im_cuda: an illegal memory access was accounted for" on my end。Is there a phenomenon similar to mine? May I ask how to solve it?