neural-motifs icon indicating copy to clipboard operation
neural-motifs copied to clipboard

RuntimeError: cuda runtime error (8) in nms

Open wtliao opened this issue 7 years ago • 11 comments

Hi, I have encounted the following error when I run the code on two Titan XP:

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1518238409320/work/torch/lib/THC/generic/THCTensorMathPairwise.cu line=21 error=8 : invalid device function Traceback (most recent call last): File "/home/wtliao/.pycharm_helpers/pydev/pydevd.py", line 1668, in main() File "/home/wtliao/.pycharm_helpers/pydev/pydevd.py", line 1662, in main globals = debugger.run(setup['file'], None, None, is_module) File "/home/wtliao/.pycharm_helpers/pydev/pydevd.py", line 1072, in run pydev_imports.execfile(file, globals, locals) # execute the script File "/home/wtliao/work_space/neural-motifs-master/models/train_detector.py", line 214, in rez = train_epoch(epoch) File "/home/wtliao/work_space/neural-motifs-master/models/train_detector.py", line 70, in train_epoch tr.append(train_batch(batch)) File "/home/wtliao/work_space/neural-motifs-master/models/train_detector.py", line 103, in train_batch result = detector[b] File "/home/wtliao/work_space/neural-motifs-master/lib/object_detector.py", line 418, in getitem outputs = nn.parallel.parallel_apply(replicas, [batch[i] for i in range(self.num_gpus)]) File "/home/wtliao/anaconda2/envs/pytorch/lib/python2.7/site-packages/torch/nn/parallel/parallel_apply.py", line 67, in parallel_apply raise output

After debug line by line, I find that this error arises in the operation: keep.append(keep_im + s), line 24 in nms.py

Any idea to solve it? Thanks!

wtliao avatar Oct 18 '18 05:10 wtliao

eventhrough I try to use single GPU, I have the same issue

wtliao avatar Oct 18 '18 06:10 wtliao

I'm not entirely sure what's going on here, but it seems like you're using Python 2, which I don't support with this repo. have you tried using Python 3?

rowanz avatar Oct 19 '18 04:10 rowanz

@rowanz when use py3.6., the error is: RuntimeError: cuda runtime error (48) : no kernel image is available for execution on the device at /opt/conda/conda-bld/pytorch_1512387374934/work/torch/lib/THC/generic/THCTensorMathPairwise.cu:21

I solved it in a strang way:

   try:
      keep.append(keep_im + s)
  except BaseException:
       keep.append(keep_im + s)

which means to operate it twice and it works..... I has no idea why.

wtliao avatar Oct 22 '18 11:10 wtliao

Now I solved this problem by recompile the nms files using

#!/usr/bin/env bash
# CUDA_PATH=/usr/local/cuda/
cd src/cuda
echo "Compiling stnn kernels by nvcc..."
nvcc -c -o nms.cu.o nms_kernel.cu -x cu -Xcompiler -fPIC -arch=sm_52
cd ../../
python build.py

But a new problem arises in

feature_pool = RoIAlignFunction(self.pooling_size, self.pooling_size, spatial_scale=1 / 16)(
            self.compress(features) if self.use_resnet else features, rois)

with error information

cudaCheckError() failed : no kernel image is available for execution on the device

Process finished with exit code 255

I can't figure out where the problem is. Do you have any idea about it? I fix it by replacing the roi_align file with this roi_align. Now the code can run through. If you can fix the original roi_align, it will be much better. Thanks for sharing your impressive work again

wtliao avatar Oct 22 '18 15:10 wtliao

my guess is that you have a newer version of cuda than I did last year. Possibly you’d need to compile it with -arch=sm_61 ? Sorry for the difficulty anyways, I really wish pytorch had native roipooling (which they’re working on for v1)

On Mon, Oct 22, 2018 at 8:18 AM wtliao [email protected] wrote:

Now I solved this problem by recompile the nms files using

#!/usr/bin/env bash

CUDA_PATH=/usr/local/cuda/

cd src/cuda echo "Compiling stnn kernels by nvcc..." nvcc -c -o nms.cu.o nms_kernel.cu -x cu -Xcompiler -fPIC -arch=sm_52 cd ../../ python build.py

But a new problem arises in

feature_pool = RoIAlignFunction(self.pooling_size, self.pooling_size, spatial_scale=1 / 16)( self.compress(features) if self.use_resnet else features, rois)

with error information

cudaCheckError() failed : no kernel image is available for execution on the device

Process finished with exit code 255

I can't figure out where the problem is. Do you have any idea about it? I fix it by replacing the roi_align file with this roi_align https://github.com/jwyang/faster-rcnn.pytorch/tree/master/lib/model/roi_align . Now the code can run through. If you can fix the original roi_align, it will be much better. Thanks for sharing your impressive work again

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/rowanz/neural-motifs/issues/33#issuecomment-431865986, or mute the thread https://github.com/notifications/unsubscribe-auth/ABWJx2nPZNj5WbF95GdiIbc3mTDk4epXks5uneHbgaJpZM4XsrNn .

rowanz avatar Oct 22 '18 17:10 rowanz

@rowanz I'm facing the exact same issue. But for me, doing the same steps as @wtliao suggested didn't get rid of the error. I'm using CUDA 8.0 and Tesla K80s. Therefore I even tried compiling with sm_37 in nms, roi_aling and highway_lstms. What would you advise me to do?

@wtliao could you find a fix for the error?

ritwickchaudhry avatar Nov 20 '18 05:11 ritwickchaudhry

Hi, I solved the issues as I described above. I have tried the code on CUDA9.0+K40, CUDA9.0+P100, and CUDA8.0+TITAN XP, they all works now. So I guess, you can try to update to CUDA9.0. I can't fix the roi_aling issues in the author's code. I replace it with mine. BTW, I didn't compile the code using the Makefile provided by the autor. I compiled each part of the code one by one using my make.sh under the corresponding dir.

@rowanz I'm facing the exact same issue. But for me, doing the same steps as @wtliao suggested didn't get rid of the error. I'm using CUDA 8.0 and Tesla K80s. Therefore I even tried compiling with sm_37 in nms, roi_aling and highway_lstms. What would you advise me to do?

@wtliao could you find a fix for the error?

wtliao avatar Nov 20 '18 08:11 wtliao

Thanks a lot @wtliao , I got it running now. And thanks a lot @rowanz for sharing your amazing work. One small doubt, can you please tell me the interpretation of pred_rel_inds part of the output (I believe it's a [NUM_PRED_RELS, 51] array with each pair, having scores for the 50 types of relationships. The 0th index is No relationship right? (Because the total number of relationships is 50)

ritwickchaudhry avatar Nov 27 '18 11:11 ritwickchaudhry

@wtliao , I always show that bbox_overlaps can't be found in the process of running, but I have generated the. so file. Do you have any suggestions to help me?Thank you!

L6-hong avatar Oct 12 '20 08:10 L6-hong

@wtliao , I always show that bbox_overlaps can't be found in the process of running, but I have generated the. so file. Do you have any suggestions to help me?Thank you!

you should run the command line "export PYTHONPATH=where/is/your/project/folder" before running.

wtliao avatar Oct 12 '20 12:10 wtliao

Hello, thank you very much for your reply. I have already set this, but the same problem will still occur. Is it the problem that the. so file cannot be found? In addition, I always display: Runtime Error: cuda Runtime Error (2): Out of Memory. I have changed batch_size to 1, but the same problem still occurs. Do you have any good suggestions?

------------------ 原始邮件 ------------------ 发件人: "rowanz/neural-motifs" <[email protected]>; 发送时间: 2020年10月12日(星期一) 晚上8:16 收件人: "rowanz/neural-motifs"<[email protected]>; 抄送: "李建红"<[email protected]>;"Comment"<[email protected]>; 主题: Re: [rowanz/neural-motifs] RuntimeError: cuda runtime error (8) in nms (#33)

@wtliao , I always show that bbox_overlaps can't be found in the process of running, but I have generated the. so file. Do you have any suggestions to help me?Thank you!

you should run the command line "export PYTHONPATH=where/is/your/project/folder" before running.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

L6-hong avatar Oct 12 '20 12:10 L6-hong