jetson-containers torchvision.ops.nms fails on GPU data inside the container nvcr.io/nvidia/l4t-pytorch:r32.4.2-pth1.3-py3, but works as expected on the host OS

Hey @dusty-nv

I've been having an issue running my object detection model within the container nvcr.io/nvidia/l4t-pytorch:r32.4.2-pth1.3-py3. Specifically, when running inference, I call torchvision.ops.nms in order to perform non-maximum suppression on the objects detected by the network. When doing inference in the container, this gives the following error:

File "/usr/local/lib/python3.6/dist-packages/torchvision-0.4.2-py3.6-linux-aarch64.egg/torchvision/ops/boxes.py", line 33, in nms
RuntimeError: CUDA error: no kernel image is available for execution on the device (nms_cuda at /torchvision/torchvision/csrc/cuda/nms_cuda.cu:127)
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x78 (0x7f5e7378d8 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #1: nms_cuda(at::Tensor const&, at::Tensor const&, float) + 0x710 (0x7f3e0eb51c in /root/.cache/Python-Eggs/torchvision-0.4.2-py3.6-linux-aarch64.egg-tmp/torchvision/_C.so)
frame #2: nms(at::Tensor const&, at::Tensor const&, float) + 0x114 (0x7f3e08ae7c in /root/.cache/Python-Eggs/torchvision-0.4.2-py3.6-linux-aarch64.egg-tmp/torchvision/_C.so)
frame #3: <unknown function> + 0x73b70 (0x7f3e0bab70 in /root/.cache/Python-Eggs/torchvision-0.4.2-py3.6-linux-aarch64.egg-tmp/torchvision/_C.so)
frame #4: <unknown function> + 0x70248 (0x7f3e0b7248 in /root/.cache/Python-Eggs/torchvision-0.4.2-py3.6-linux-aarch64.egg-tmp/torchvision/_C.so)
frame #5: <unknown function> + 0x69718 (0x7f3e0b0718 in /root/.cache/Python-Eggs/torchvision-0.4.2-py3.6-linux-aarch64.egg-tmp/torchvision/_C.so)
frame #6: <unknown function> + 0x699e4 (0x7f3e0b09e4 in /root/.cache/Python-Eggs/torchvision-0.4.2-py3.6-linux-aarch64.egg-tmp/torchvision/_C.so)
frame #7: <unknown function> + 0x534a4 (0x7f3e09a4a4 in /root/.cache/Python-Eggs/torchvision-0.4.2-py3.6-linux-aarch64.egg-tmp/torchvision/_C.so)
<omitting python frames>
frame #9: python3() [0x529958]
frame #11: python3() [0x527860]
frame #12: python3() [0x5297dc]
frame #14: python3() [0x528ff0]
frame #17: python3() [0x5f2bcc]
frame #20: python3() [0x528ff0]
frame #23: python3() [0x5f2bcc]
frame #25: python3() [0x595e5c]
frame #28: python3() [0x528ff0]
frame #31: python3() [0x5f2bcc]
frame #34: python3() [0x528ff0]
frame #37: python3() [0x5f2bcc]
frame #39: python3() [0x595e5c]
frame #41: python3() [0x529738]
frame #43: python3() [0x527860]
frame #44: python3() [0x5297dc]
frame #46: python3() [0x528ff0]
frame #51: __libc_start_main + 0xe0 (0x7f9d2256e0 in /lib/aarch64-linux-gnu/libc.so.6)
frame #52: python3() [0x420e94]

Segmentation fault (core dumped)

To simplify the debugging process, I've come up with a minimal program that gives the same error as above:

import torch
import torchvision
bboxes = [[0.0, 0.0, 2.0, 2.0], [0.75, 0.75, 1.0, 1.0]]
scores = torch.tensor([1., 0.5]).cuda()
boxes = torch.tensor(bboxes).cuda()
keep = torchvision.ops.nms(boxes, scores, 0.7)
print(keep)

When running this code from within the container, I get essentially the same error message:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/dist-packages/torchvision-0.4.2-py3.6-linux-aarch64.egg/torchvision/ops/boxes.py", line 33, in nms
RuntimeError: CUDA error: no kernel image is available for execution on the device (nms_cuda at /torchvision/torchvision/csrc/cuda/nms_cuda.cu:127)
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x78 (0x7f8adb98d8 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #1: nms_cuda(at::Tensor const&, at::Tensor const&, float) + 0x710 (0x7f6541151c in /root/.cache/Python-Eggs/torchvision-0.4.2-py3.6-linux-aarch64.egg-tmp/torchvision/_C.so)
frame #2: nms(at::Tensor const&, at::Tensor const&, float) + 0x114 (0x7f653b0e7c in /root/.cache/Python-Eggs/torchvision-0.4.2-py3.6-linux-aarch64.egg-tmp/torchvision/_C.so)
frame #3: <unknown function> + 0x73b70 (0x7f653e0b70 in /root/.cache/Python-Eggs/torchvision-0.4.2-py3.6-linux-aarch64.egg-tmp/torchvision/_C.so)
frame #4: <unknown function> + 0x70248 (0x7f653dd248 in /root/.cache/Python-Eggs/torchvision-0.4.2-py3.6-linux-aarch64.egg-tmp/torchvision/_C.so)
frame #5: <unknown function> + 0x69718 (0x7f653d6718 in /root/.cache/Python-Eggs/torchvision-0.4.2-py3.6-linux-aarch64.egg-tmp/torchvision/_C.so)
frame #6: <unknown function> + 0x699e4 (0x7f653d69e4 in /root/.cache/Python-Eggs/torchvision-0.4.2-py3.6-linux-aarch64.egg-tmp/torchvision/_C.so)
frame #7: <unknown function> + 0x534a4 (0x7f653c04a4 in /root/.cache/Python-Eggs/torchvision-0.4.2-py3.6-linux-aarch64.egg-tmp/torchvision/_C.so)
<omitting python frames>
frame #9: python3() [0x529958]
frame #11: python3() [0x527860]
frame #12: python3() [0x5297dc]
frame #14: python3() [0x528ff0]
frame #15: python3() [0x63075c]
frame #20: __libc_start_main + 0xe0 (0x7fb7aa66e0 in /lib/aarch64-linux-gnu/libc.so.6)
frame #21: python3() [0x420e94]

However, when I run this on the host OS, there are no errors. Here is the output of running jetson_release on that device (note that it has torch 1.3 and torchvision 0.4.2 installed as well):

 - NVIDIA Jetson Nano (Developer Kit Version)
   * Jetpack 4.4 DP [L4T 32.4.2]
   * NV Power Mode: MAXN - Type: 0
   * jetson_clocks service: inactive
 - Libraries:
   * CUDA: 10.2.89
   * cuDNN: 8.0.0.145
   * TensorRT: 7.1.0.16
   * Visionworks: 1.6.0.501
   * OpenCV: 4.1.1 compiled CUDA: NO
   * VPI: 0.2.0
   * Vulkan: 1.2.70

And the output of running the minimum program is tensor([0, 1], device='cuda:0'). Do you know why this program fails to run from within the container?

May 27 '20 21:05 astekardis

Hmm it may be because torchvision was compiled and detected the GPU arch I built it on (Xavier), instead of the arch's that I built PyTorch with (Nano, TX2, Xavier). I will have to investigate how to force other GPU arch's in torchvision.

If you re-build the pytorch container on your Jetson, I think it should work. You can comment out the containers other than pytorch in scripts/docker_build_all.sh and it will build faster.

From: astekardis [email protected] Sent: Wednesday, May 27, 2020 5:11:39 PM To: dusty-nv/jetson-containers [email protected] Cc: Dustin Franklin [email protected]; Mention [email protected] Subject: [dusty-nv/jetson-containers] torchvision.ops.nms fails on GPU data inside the container nvcr.io/nvidia/l4t-pytorch:r32.4.2-pth1.3-py3, but works as expected on the host OS (#7)

Hey @dusty-nvhttps://github.com/dusty-nv

I've been having an issue running my object detection model within the container nvcr.io/nvidia/l4t-pytorch:r32.4.2-pth1.3-py3. Specifically, when running inference, I call torchvision.ops.nms in order to perform non-maximum suppression on the objects detected by the network. When doing inference in the container, this gives the following error:

File "/usr/local/lib/python3.6/dist-packages/torchvision-0.4.2-py3.6-linux-aarch64.egg/torchvision/ops/boxes.py", line 33, in nms RuntimeError: CUDA error: no kernel image is available for execution on the device (nms_cuda at /torchvision/torchvision/csrc/cuda/nms_cuda.cu:127) frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0x78 (0x7f5e7378d8 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so) frame #1: nms_cuda(at::Tensor const&, at::Tensor const&, float) + 0x710 (0x7f3e0eb51c in /root/.cache/Python-Eggs/torchvision-0.4.2-py3.6-linux-aarch64.egg-tmp/torchvision/_C.so) frame #2: nms(at::Tensor const&, at::Tensor const&, float) + 0x114 (0x7f3e08ae7c in /root/.cache/Python-Eggs/torchvision-0.4.2-py3.6-linux-aarch64.egg-tmp/torchvision/_C.so) frame #3: + 0x73b70 (0x7f3e0bab70 in /root/.cache/Python-Eggs/torchvision-0.4.2-py3.6-linux-aarch64.egg-tmp/torchvision/_C.so) frame #4: + 0x70248 (0x7f3e0b7248 in /root/.cache/Python-Eggs/torchvision-0.4.2-py3.6-linux-aarch64.egg-tmp/torchvision/_C.so) frame #5: + 0x69718 (0x7f3e0b0718 in /root/.cache/Python-Eggs/torchvision-0.4.2-py3.6-linux-aarch64.egg-tmp/torchvision/_C.so) frame #6: + 0x699e4 (0x7f3e0b09e4 in /root/.cache/Python-Eggs/torchvision-0.4.2-py3.6-linux-aarch64.egg-tmp/torchvision/_C.so) frame #7: + 0x534a4 (0x7f3e09a4a4 in /root/.cache/Python-Eggs/torchvision-0.4.2-py3.6-linux-aarch64.egg-tmp/torchvision/_C.so) frame #9: python3() [0x529958] frame #11: python3() [0x527860] frame #12: python3() [0x5297dc] frame #14: python3() [0x528ff0] frame #17: python3() [0x5f2bcc] frame #20: python3() [0x528ff0] frame #23: python3() [0x5f2bcc] frame #25: python3() [0x595e5c] frame #28: python3() [0x528ff0] frame #31: python3() [0x5f2bcc] frame #34: python3() [0x528ff0] frame #37: python3() [0x5f2bcc] frame #39: python3() [0x595e5c] frame #41: python3() [0x529738] frame #43: python3() [0x527860] frame #44: python3() [0x5297dc] frame #46: python3() [0x528ff0] frame #51: __libc_start_main + 0xe0 (0x7f9d2256e0 in /lib/aarch64-linux-gnu/libc.so.6) frame #52: python3() [0x420e94]

Segmentation fault (core dumped)

To simplify the debugging process, I've come up with a minimal program that gives the same error as above:

import torch import torchvision bboxes = [[0.0, 0.0, 2.0, 2.0], [0.75, 0.75, 1.0, 1.0]] scores = torch.tensor([1., 0.5]).cuda() boxes = torch.tensor(bboxes).cuda() keep = torchvision.ops.nms(boxes, scores, 0.7) print(keep)

When running this code from within the container, I get essentially the same error message:

Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python3.6/dist-packages/torchvision-0.4.2-py3.6-linux-aarch64.egg/torchvision/ops/boxes.py", line 33, in nms RuntimeError: CUDA error: no kernel image is available for execution on the device (nms_cuda at /torchvision/torchvision/csrc/cuda/nms_cuda.cu:127) frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0x78 (0x7f8adb98d8 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so) frame #1: nms_cuda(at::Tensor const&, at::Tensor const&, float) + 0x710 (0x7f6541151c in /root/.cache/Python-Eggs/torchvision-0.4.2-py3.6-linux-aarch64.egg-tmp/torchvision/_C.so) frame #2: nms(at::Tensor const&, at::Tensor const&, float) + 0x114 (0x7f653b0e7c in /root/.cache/Python-Eggs/torchvision-0.4.2-py3.6-linux-aarch64.egg-tmp/torchvision/_C.so) frame #3: + 0x73b70 (0x7f653e0b70 in /root/.cache/Python-Eggs/torchvision-0.4.2-py3.6-linux-aarch64.egg-tmp/torchvision/_C.so) frame #4: + 0x70248 (0x7f653dd248 in /root/.cache/Python-Eggs/torchvision-0.4.2-py3.6-linux-aarch64.egg-tmp/torchvision/_C.so) frame #5: + 0x69718 (0x7f653d6718 in /root/.cache/Python-Eggs/torchvision-0.4.2-py3.6-linux-aarch64.egg-tmp/torchvision/_C.so) frame #6: + 0x699e4 (0x7f653d69e4 in /root/.cache/Python-Eggs/torchvision-0.4.2-py3.6-linux-aarch64.egg-tmp/torchvision/_C.so) frame #7: + 0x534a4 (0x7f653c04a4 in /root/.cache/Python-Eggs/torchvision-0.4.2-py3.6-linux-aarch64.egg-tmp/torchvision/_C.so) frame #9: python3() [0x529958] frame #11: python3() [0x527860] frame #12: python3() [0x5297dc] frame #14: python3() [0x528ff0] frame #15: python3() [0x63075c] frame #20: __libc_start_main + 0xe0 (0x7fb7aa66e0 in /lib/aarch64-linux-gnu/libc.so.6) frame #21: python3() [0x420e94]

However, when I run this on the host OS, there are no errors. Here is the output of running jetson_release on that device (note that it has torch 1.3 and torchvision 0.4.2 installed as well):

NVIDIA Jetson Nano (Developer Kit Version)
- Jetpack 4.4 DP [L4T 32.4.2]
- NV Power Mode: MAXN - Type: 0
- jetson_clocks service: inactive
Libraries:
- CUDA: 10.2.89
- cuDNN: 8.0.0.145
- TensorRT: 7.1.0.16
- Visionworks: 1.6.0.501
- OpenCV: 4.1.1 compiled CUDA: NO
- VPI: 0.2.0
- Vulkan: 1.2.70

And the output of running the minimum program is tensor([0, 1], device='cuda:0'). Do you know why this program fails to run from within the container?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/dusty-nv/jetson-containers/issues/7, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADVEGK3GIXBW6262AK3NJCLRTV6YXANCNFSM4NMQEIMA.

May 28 '20 00:05 dusty-nv

Ah, interesting. I'll give that a try and post here when I'm done.

May 28 '20 00:05 astekardis

Trying to build on my Nano (running L4T 32.4.2) gives this error:

Step 13/17 : RUN git clone -b ${TORCHVISION_VERSION} https://github.com/pytorch/vision torchvision &&     cd torchvision &&     python3 setup.py install &&     cd ../ &&     rm -rf torchvision &&     pip3 install "${PILLOW_VERSION}"
 ---> Running in bf6bfc8a272f
Cloning into 'torchvision'...
Traceback (most recent call last):
  File "setup.py", line 14, in <module>
    import torch
  File "/usr/local/lib/python3.6/dist-packages/torch/__init__.py", line 81, in <module>
    from torch._C import *
ImportError: libnvToolsExt.so.1: cannot open shared object file: No such file or directory
The command '/bin/sh -c git clone -b ${TORCHVISION_VERSION} https://github.com/pytorch/vision torchvision &&     cd torchvision &&     python3 setup.py install &&     cd ../ &&     rm -rf torchvision &&     pip3 install "${PILLOW_VERSION}"' returned a non-zero code: 1

Have you seen this issue before while building?

May 28 '20 00:05 astekardis

Note that this is what I'm running inside the docker_build_all.sh (the other installs are commented out):

# PyTorch v1.3.0
build_pytorch "https://nvidia.box.com/shared/static/017sci9z4a0xhtwrb4ps52frdfti9iw0.whl" \
			  "torch-1.3.0-cp36-cp36m-linux_aarch64.whl" \
			  "l4t-pytorch:r32.4.2-pth1.3-py3" \
			  "v0.4.2" \
			  "pillow<7"

May 28 '20 01:05 astekardis

@astekardis did you set your docker default-runtime to nvidia, as shown here - https://github.com/dusty-nv/jetson-containers#docker-default-runtime

That enables the nvidia runtime to be used during docker build operations.

May 28 '20 01:05 dusty-nv

Ah, I was trying to build on a device with a fresh install of JetPack 4.4 and had forgotten to do that. Thanks. I'll post when it finishes building the docker image.

May 28 '20 01:05 astekardis

Okay, the minimum program works in the image that I locally built. Should I leave this issue open while you look into the issue with the image(s) on nvcr?

May 28 '20 02:05 astekardis

I am having the same issue, while building upon nvcr.io/nvidia/l4t-base:r32.6.1, with custom built torchvision (for a jetson TX2 NX), will keep this posted.

Oct 06 '22 10:10 ntakouris

Just want to report, that I am also facing this issue in a Jetson Orin using l4t-ml docker image [JetPack 5.0.2 (L4T R35.1.0)]

Aug 17 '23 21:08 udit7395

Just want to report, that I am also facing this issue in a Jetson Orin using l4t-ml docker image [JetPack 5.0.2 (L4T R35.1.0)]

@udit7395 I would recommend trying (or building) one of the updated l4t-ml or l4t-pytorch container images:

https://github.com/dusty-nv/jetson-containers/tree/master/packages/l4t/l4t-ml
https://github.com/dusty-nv/jetson-containers/tree/master/packages/l4t/l4t-pytorch

Aug 17 '23 21:08 dusty-nv

@dusty-nv Thanks, I am no longer facing this issue.

Aug 18 '23 18:08 udit7395

jetson-containers jetson-containers copied to clipboard

torchvision.ops.nms fails on GPU data inside the container nvcr.io/nvidia/l4t-pytorch:r32.4.2-pth1.3-py3, but works as expected on the host OS

jetson-containers
jetson-containers copied to clipboard