yolact
yolact copied to clipboard
RuntimeError: CUDA error: an illegal memory access was encountered
Hi ,when i run train.py to train,i have this problem and can't solve it.i found some solution like add CUDA_LAUNCH_BLOCKING=1 but it doesn't work. I also try to reinstall cuda 10.0 with pytorch 1.01 and still have the same problem.I thought maybe is the batch size problem so i change batch size to 2,and max_size to 100 in config.py,it didn't help.
Ubuntu:18.04 python:3.6 cuda:10.2 pytorch:1.6
python3 train.py --config=yolact_base_config
Scaling parameters by 0.25 to account for a batch size of 2.
Per-GPU batch size is less than the recommended limit for batch norm. Disabling batch norm.
loading annotations into memory...
Done (t=0.00s)
creating index...
index created!
loading annotations into memory...
Done (t=0.00s)
creating index...
index created!
/home/yang/.local/lib/python3.6/site-packages/torch/jit/_recursive.py:152: UserWarning: 'lat_layers' was found in ScriptModule constants, but it is a non-constant submodule. Consider removing it.
" but it is a non-constant {}. Consider removing it.".format(name, hint))
/home/yang/.local/lib/python3.6/site-packages/torch/jit/_recursive.py:152: UserWarning: 'pred_layers' was found in ScriptModule constants, but it is a non-constant submodule. Consider removing it.
" but it is a non-constant {}. Consider removing it.".format(name, hint))
/home/yang/.local/lib/python3.6/site-packages/torch/jit/_recursive.py:152: UserWarning: 'downsample_layers' was found in ScriptModule constants, but it is a non-constant submodule. Consider removing it.
" but it is a non-constant {}. Consider removing it.".format(name, hint))
Initializing weights...
THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCTensorMath.cu line=19 error=700 : an illegal memory access was encountered
Traceback (most recent call last):
File "train.py", line 505, in
When i add CUDA_LAUNCH_BLOCKING=1,it will have new problem:
CUDA_LAUNCH_BLOCKING=1 python3 train.py --config=yolact_base_config
Scaling parameters by 0.25 to account for a batch size of 2.
Per-GPU batch size is less than the recommended limit for batch norm. Disabling batch norm.
loading annotations into memory...
Done (t=0.00s)
creating index...
index created!
loading annotations into memory...
Done (t=0.00s)
creating index...
index created!
/home/yang/.local/lib/python3.6/site-packages/torch/jit/_recursive.py:152: UserWarning: 'lat_layers' was found in ScriptModule constants, but it is a non-constant submodule. Consider removing it.
" but it is a non-constant {}. Consider removing it.".format(name, hint))
/home/yang/.local/lib/python3.6/site-packages/torch/jit/_recursive.py:152: UserWarning: 'pred_layers' was found in ScriptModule constants, but it is a non-constant submodule. Consider removing it.
" but it is a non-constant {}. Consider removing it.".format(name, hint))
/home/yang/.local/lib/python3.6/site-packages/torch/jit/_recursive.py:152: UserWarning: 'downsample_layers' was found in ScriptModule constants, but it is a non-constant submodule. Consider removing it.
" but it is a non-constant {}. Consider removing it.".format(name, hint))
Initializing weights...
Traceback (most recent call last):
File "train.py", line 505, in cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)
If any one know how to solve this problem,please help me!
Hi, did you fix this problem? I met the same problem as you. Can you help me? Thank you so much!