yolact icon indicating copy to clipboard operation
yolact copied to clipboard

RuntimeError: CUDA error: an illegal memory access was encountered

Open MichaelMano3 opened this issue 4 years ago • 1 comments

Hi ,when i run train.py to train,i have this problem and can't solve it.i found some solution like add CUDA_LAUNCH_BLOCKING=1 but it doesn't work. I also try to reinstall cuda 10.0 with pytorch 1.01 and still have the same problem.I thought maybe is the batch size problem so i change batch size to 2,and max_size to 100 in config.py,it didn't help.

Ubuntu:18.04 python:3.6 cuda:10.2 pytorch:1.6

python3 train.py --config=yolact_base_config Scaling parameters by 0.25 to account for a batch size of 2. Per-GPU batch size is less than the recommended limit for batch norm. Disabling batch norm. loading annotations into memory... Done (t=0.00s) creating index... index created! loading annotations into memory... Done (t=0.00s) creating index... index created! /home/yang/.local/lib/python3.6/site-packages/torch/jit/_recursive.py:152: UserWarning: 'lat_layers' was found in ScriptModule constants, but it is a non-constant submodule. Consider removing it. " but it is a non-constant {}. Consider removing it.".format(name, hint)) /home/yang/.local/lib/python3.6/site-packages/torch/jit/_recursive.py:152: UserWarning: 'pred_layers' was found in ScriptModule constants, but it is a non-constant submodule. Consider removing it. " but it is a non-constant {}. Consider removing it.".format(name, hint)) /home/yang/.local/lib/python3.6/site-packages/torch/jit/_recursive.py:152: UserWarning: 'downsample_layers' was found in ScriptModule constants, but it is a non-constant submodule. Consider removing it. " but it is a non-constant {}. Consider removing it.".format(name, hint)) Initializing weights... THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCTensorMath.cu line=19 error=700 : an illegal memory access was encountered Traceback (most recent call last): File "train.py", line 505, in train() File "train.py", line 235, in train yolact_net(torch.zeros(1, 3, cfg.max_size, cfg.max_size).cuda()) File "/home/yang/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "/home/yang/yolact/yolact.py", line 571, in forward outs = self.backbone(x) File "/home/yang/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "/home/yang/yolact/backbone.py", line 136, in forward x = layer(x) File "/home/yang/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "/home/yang/.local/lib/python3.6/site-packages/torch/nn/modules/container.py", line 100, in forward input = module(input) File "/home/yang/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "/home/yang/yolact/backbone.py", line 44, in forward out = self.conv2(out) File "/home/yang/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "/home/yang/.local/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 349, in forward return self._conv_forward(input, self.weight) File "/home/yang/.local/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 346, in _conv_forward self.padding, self.dilation, self.groups) RuntimeError: cuda runtime error (700) : an illegal memory access was encountered at /pytorch/aten/src/THC/generic/THCTensorMath.cu:19

When i add CUDA_LAUNCH_BLOCKING=1,it will have new problem:

CUDA_LAUNCH_BLOCKING=1 python3 train.py --config=yolact_base_config Scaling parameters by 0.25 to account for a batch size of 2. Per-GPU batch size is less than the recommended limit for batch norm. Disabling batch norm. loading annotations into memory... Done (t=0.00s) creating index... index created! loading annotations into memory... Done (t=0.00s) creating index... index created! /home/yang/.local/lib/python3.6/site-packages/torch/jit/_recursive.py:152: UserWarning: 'lat_layers' was found in ScriptModule constants, but it is a non-constant submodule. Consider removing it. " but it is a non-constant {}. Consider removing it.".format(name, hint)) /home/yang/.local/lib/python3.6/site-packages/torch/jit/_recursive.py:152: UserWarning: 'pred_layers' was found in ScriptModule constants, but it is a non-constant submodule. Consider removing it. " but it is a non-constant {}. Consider removing it.".format(name, hint)) /home/yang/.local/lib/python3.6/site-packages/torch/jit/_recursive.py:152: UserWarning: 'downsample_layers' was found in ScriptModule constants, but it is a non-constant submodule. Consider removing it. " but it is a non-constant {}. Consider removing it.".format(name, hint)) Initializing weights... Traceback (most recent call last): File "train.py", line 505, in train() File "train.py", line 235, in train yolact_net(torch.zeros(1, 3, cfg.max_size, cfg.max_size).cuda()) File "/home/yang/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "/home/yang/yolact/yolact.py", line 571, in forward outs = self.backbone(x) File "/home/yang/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "/home/yang/yolact/backbone.py", line 129, in forward x = self.conv1(x) File "/home/yang/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "/home/yang/.local/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 349, in forward return self._conv_forward(input, self.weight) File "/home/yang/.local/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 346, in _conv_forward self.padding, self.dilation, self.groups) RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)

If any one know how to solve this problem,please help me!

MichaelMano3 avatar Oct 06 '20 04:10 MichaelMano3

Hi, did you fix this problem? I met the same problem as you. Can you help me? Thank you so much!

kizoooh avatar Oct 25 '21 08:10 kizoooh