frigate icon indicating copy to clipboard operation
frigate copied to clipboard

[Support]: Docker (0.12.0-beta2-tensorrt) exception trying to load libnvrtc.so (not found)?

Open Codelica opened this issue 2 years ago • 9 comments

Describe the problem you are having

I'm at a loss and hoping for any suggestions. Basically I'm trying to get a TensorRT detector working with blakeblackshear/frigate:0.12.0-beta2-tensorrt (Docker compose config).

I feel like my general NVIDIA configuration is OK, given:

  • I was able to generate the trt-models using the tensorrt_models.sh script inside a nvcr.io/nvidia/tensorrt:22.07-py3 container
  • nvidia-smi works in the Frigate container, on the host, and in my other NVIDIA runtime containers.
  • ffmpeg hardware acceleration is working fine with the Frigate container using preset-nvidia-h264 and -c:v h264_cuvid
  • I'm running other containers which use CUDA, etc.

However, when trying to startup a TensorRT detector, I get the following:

Could not load library libcudnn_cnn_infer.so.8. Error: libnvrtc.so: cannot open shared object file: No such file or directory
Fatal Python error: Aborted

I see libnvrtc.so on both my host and inside the nvcr.io/nvidia/tensorrt:22.07-py3 and other containers, but not inside my Frigate container. So I'm perplexed as as to how I can make libnvrtc.so (from CUDA?) available in the container short of bind mounting /usr/local/cuda-11.7/targets/x86_64-linux/lib/ from the host. (having tried a variety of compose options)

Version

blakeblackshear/frigate:0.12.0-beta2-tensorrt

Frigate config file

# I'm using this simplified config to test, which runs fine when moved to CPU detector

mqtt:
  host: mqtt.mydomain.com
  port: 8883
  client_id: frigate
  topic_prefix: frigate
  user: myuser
  password: mypass
  tls_ca_certs: /etc/ssl/certs/ca-certificates.crt
  tls_insecure: false

cameras:
  Front-Door:
    ffmpeg:
      hwaccel_args: preset-nvidia-h264
      input_args:
        - -c:v
        - h264_cuvid
      inputs:
        - path: rtsp://myuser:[email protected]:10554/Streaming/Channels/202
          roles:
            - detect
            - restream
        - path: rtsp://myuser:[email protected]:10554/Streaming/Channels/201
          roles:
            - record
    snapshots:
      enabled: true
    motion:
      mask:
        - 142,28,241,33,241,0,142,0
    detect:
      width: 640
      height: 360

detectors:
  tensorrt:
    type: tensorrt

model:
  path: /trt-models/yolov7-tiny-416.trt
  input_tensor: nchw
  input_pixel_format: rgb
  width: 416
  height: 416

Relevant log output

s6-rc: info: service s6rc-oneshot-runner: starting
s6-rc: info: service s6rc-oneshot-runner successfully started
s6-rc: info: service fix-attrs: starting
s6-rc: info: service fix-attrs successfully started
s6-rc: info: service legacy-cont-init: starting
cont-init: info: running /etc/cont-init.d/prepare-logs.sh
cont-init: info: /etc/cont-init.d/prepare-logs.sh exited 0
s6-rc: info: service legacy-cont-init successfully started
s6-rc: info: service legacy-services: starting
services-up: info: copying legacy longrun frigate (no readiness notification)
services-up: info: copying legacy longrun go2rtc (no readiness notification)
services-up: info: copying legacy longrun nginx (no readiness notification)
s6-rc: info: service legacy-services successfully started
2023-01-11 00:46:53.496196078  07:46:53.496 INF go2rtc version 0.1-rc.6 linux/amd64
2023-01-11 00:46:53.496959381  07:46:53.496 INF [api] listen addr=:1984
2023-01-11 00:46:53.497028236  07:46:53.497 INF [rtsp] listen addr=:8554
2023-01-11 00:46:53.497228724  07:46:53.497 INF [webrtc] listen addr=:8555
2023-01-11 00:46:53.497280472  07:46:53.497 INF [srtp] listen addr=:8443
2023-01-11 00:46:54.639356794  [2023-01-11 00:46:54] frigate.app                    INFO    : Starting Frigate (0.12.0-0dbf909)
2023-01-11 00:46:54.661348602  [2023-01-11 00:46:54] peewee_migrate                 INFO    : Starting migrations
2023-01-11 00:46:54.666553629  [2023-01-11 00:46:54] peewee_migrate                 INFO    : There is nothing to migrate
2023-01-11 00:46:54.674083840  [2023-01-11 00:46:54] ws4py                          INFO    : Using epoll
2023-01-11 00:46:54.690982397  [2023-01-11 00:46:54] detector.tensorrt              INFO    : Starting detection process: 970
2023-01-11 00:46:54.691723240  [2023-01-11 00:46:54] frigate.app                    INFO    : Output process started: 972
2023-01-11 00:46:54.694029800  [2023-01-11 00:46:54] ws4py                          INFO    : Using epoll
2023-01-11 00:46:54.695904656  [2023-01-11 00:46:54] frigate.app                    INFO    : Camera processor started for Front-Door: 976
2023-01-11 00:46:54.699253070  [2023-01-11 00:46:54] frigate.app                    INFO    : Capture process started for Front-Door: 978
2023-01-11 00:46:55.148182652  [2023-01-11 00:46:55] frigate.detectors.plugins.tensorrt INFO    : [MemUsageChange] Init CUDA: CPU +188, GPU +0, now: CPU 241, GPU 127 (MiB)
2023-01-11 00:46:55.166258368  [2023-01-11 00:46:55] frigate.detectors.plugins.tensorrt INFO    : Loaded engine size: 35 MiB
2023-01-11 00:46:55.512402191  [2023-01-11 00:46:55] frigate.detectors.plugins.tensorrt INFO    : [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +192, GPU +74, now: CPU 496, GPU 241 (MiB)
2023-01-11 00:46:55.690972712  [2023-01-11 00:46:55] frigate.detectors.plugins.tensorrt INFO    : [MemUsageChange] Init cuDNN: CPU +110, GPU +44, now: CPU 606, GPU 285 (MiB)
2023-01-11 00:46:55.705521956  Could not load library libcudnn_cnn_infer.so.8. Error: libnvrtc.so: cannot open shared object file: No such file or directory
2023-01-11 00:46:55.705531168  Fatal Python error: Aborted
2023-01-11 00:46:55.705543019
2023-01-11 00:46:55.705547155  Thread 0x00007f6348f9a6c0 (most recent call first):
2023-01-11 00:46:55.705553100    File "/usr/lib/python3.9/threading.py", line 312 in wait
2023-01-11 00:46:55.705558934    File "/usr/lib/python3.9/multiprocessing/queues.py", line 233 in _feed
2023-01-11 00:46:55.705603275    File "/usr/lib/python3.9/threading.py", line 892 in run
2023-01-11 00:46:55.705639906    File "/usr/lib/python3.9/threading.py", line 954 in _bootstrap_inner
2023-01-11 00:46:55.705644013    File "/usr/lib/python3.9/threading.py", line 912 in _bootstrap
2023-01-11 00:46:55.705647504
2023-01-11 00:46:55.705651546  Current thread 0x00007f634d256740 (most recent call first):
2023-01-11 00:46:55.705655880    File "/opt/frigate/frigate/detectors/plugins/tensorrt.py", line 229 in __init__
2023-01-11 00:46:55.705660139    File "/opt/frigate/frigate/detectors/__init__.py", line 24 in create_detector
2023-01-11 00:46:55.705664586    File "/opt/frigate/frigate/object_detection.py", line 52 in __init__
2023-01-11 00:46:55.705668786    File "/opt/frigate/frigate/object_detection.py", line 97 in run_detector
2023-01-11 00:46:55.705686380    File "/usr/lib/python3.9/multiprocessing/process.py", line 108 in run
2023-01-11 00:46:55.705690779    File "/usr/lib/python3.9/multiprocessing/process.py", line 315 in _bootstrap
2023-01-11 00:46:55.705695155    File "/usr/lib/python3.9/multiprocessing/popen_fork.py", line 71 in _launch
2023-01-11 00:46:55.705709406    File "/usr/lib/python3.9/multiprocessing/popen_fork.py", line 19 in __init__
2023-01-11 00:46:55.705730545    File "/usr/lib/python3.9/multiprocessing/context.py", line 277 in _Popen
2023-01-11 00:46:55.705754864    File "/usr/lib/python3.9/multiprocessing/context.py", line 224 in _Popen
2023-01-11 00:46:55.705792265    File "/usr/lib/python3.9/multiprocessing/process.py", line 121 in start
2023-01-11 00:46:55.705818600    File "/opt/frigate/frigate/object_detection.py", line 172 in start_or_restart
2023-01-11 00:46:55.705843911    File "/opt/frigate/frigate/object_detection.py", line 144 in __init__
2023-01-11 00:46:55.705868075    File "/opt/frigate/frigate/app.py", line 214 in start_detectors
2023-01-11 00:46:55.705889471    File "/opt/frigate/frigate/app.py", line 364 in start
2023-01-11 00:46:55.705908039    File "/opt/frigate/frigate/__main__.py", line 16 in <module>
2023-01-11 00:46:55.705937887    File "/usr/lib/python3.9/runpy.py", line 87 in _run_code
2023-01-11 00:46:55.705984158    File "/usr/lib/python3.9/runpy.py", line 197 in _run_module_as_main
2023-01-11 00:47:15.027433642  [2023-01-11 00:47:15] frigate.watchdog               INFO    : Detection appears to have stopped. Exiting frigate...
s6-rc: info: service legacy-services: stopping
2023-01-11 00:47:15.034035211  exit OK
2023-01-11 00:47:15.034394785  [2023-01-11 00:47:15] frigate.app                    INFO    : Stopping...
2023-01-11 00:47:15.035051550  [2023-01-11 00:47:15] ws4py                          INFO    : Closing all websockets with [1001] 'Server is shutting down'
2023-01-11 00:47:15.035056307  [2023-01-11 00:47:15] frigate.storage                INFO    : Exiting storage maintainer...
2023-01-11 00:47:15.037505849  [2023-01-11 00:47:15] frigate.events                 INFO    : Exiting event cleanup...
2023-01-11 00:47:15.038340104  [2023-01-11 00:47:15] frigate.record                 INFO    : Exiting recording cleanup...
2023-01-11 00:47:15.038345550  [2023-01-11 00:47:15] frigate.stats                  INFO    : Exiting watchdog...
2023-01-11 00:47:15.038360928  [2023-01-11 00:47:15] frigate.record                 INFO    : Exiting recording maintenance...
2023-01-11 00:47:15.038635641  [2023-01-11 00:47:15] frigate.watchdog               INFO    : Exiting watchdog...
2023-01-11 00:47:15.038826899  [2023-01-11 00:47:15] frigate.events                 INFO    : Exiting event processor...
s6-svwait: fatal: supervisor died
s6-rc: info: service legacy-services successfully stopped
s6-rc: info: service legacy-cont-init: stopping
s6-rc: info: service legacy-cont-init successfully stopped
s6-rc: info: service fix-attrs: stopping
s6-rc: info: service fix-attrs successfully stopped
s6-rc: info: service s6rc-oneshot-runner: stopping
s6-rc: info: service s6rc-oneshot-runner successfully stopped

FFprobe output from your camera

N/A

Frigate stats

N/A

Operating system

Debian

Install method

Docker Compose

Coral version

Other

Network connection

Wired

Camera make and model

N/A

Any other information that may be helpful

nvidia-smi inside the container (ffmpeg process doesn't show, but does on host nvidia-smi and nvtop):

root@frigate:/opt/frigate# nvidia-smi
Wed Jan 11 00:53:49 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro P2000        Off  | 00000000:51:00.0 Off |                  N/A |
| 52%   45C    P0    16W /  75W |     74MiB /  5120MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Looking for libs in Frigate container:

root@frigate:/opt/frigate# ldconfig -p |grep libcudnn_cnn_infer
<null>

root@frigate:/opt/frigate# ldconfig -p |grep libnvrtc
<null>

root@frigate:/opt/frigate# find / -name libcudnn_cnn_infer* -print
/usr/local/lib/python3.9/dist-packages/nvidia/cudnn/lib/libcudnn_cnn_infer.so.8

root@frigate:/opt/frigate# find / -name libnvrtc* -print
<null>

Looking for libs inside nvcr.io/nvidia/tensorrt:22.07-py3 used to generate /trt-models:

root@docker:/ # docker run -it --rm nvcr.io/nvidia/tensorrt:22.07-py3 sh -c 'ldconfig -p |grep libcudnn_cnn_infer'

	libcudnn_cnn_infer.so.8 (libc6,x86-64) => /lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8
	libcudnn_cnn_infer.so (libc6,x86-64) => /lib/x86_64-linux-gnu/libcudnn_cnn_infer.so

root@docker:/ # docker run -it --rm nvcr.io/nvidia/tensorrt:22.07-py3 sh -c 'ldconfig -p |grep libnvrtc'

	libnvrtc.so.11.2 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnvrtc.so.11.2
	libnvrtc.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnvrtc.so
	libnvrtc-builtins.so.11.7 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnvrtc-builtins.so.11.7
	libnvrtc-builtins.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libnvrtc-builtins.so

Docker compose file (several other variations tried with same result):

version: "3.7"
services:
  frigate:
    container_name: frigate
    hostname: frigate
    image: blakeblackshear/frigate:0.12.0-beta2-tensorrt
    privileged: true
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    shm_size: "256mb"    
    volumes:
      - /etc/localtime:/etc/localtime:ro
      - /storage/docker/frigate/config.yml:/config/config.yml:ro
      - /storage/docker/frigate/storage:/media/frigate
      - /storage/docker/frigate/trt-models:/trt-models
      - type: tmpfs
        target: /tmp/cache
        tmpfs:
          size: 1000000000
    ports:
      - "127.0.0.1:9049:5000"
    environment:
      FRIGATE_RTSP_PASSWORD: "somepassword"
      NVIDIA_VISIBLE_DEVICES: all
      NVIDIA_DRIVER_CAPABILITIES: compute,utility,video
    restart: unless-stopped

Thanks in advance for ANY ideas! 👍

Codelica avatar Jan 11 '23 08:01 Codelica

Hi, first: thanks for all this work! Looking forward having GPU detection and restream. I very much appreciate your works!

I do have the same issue as Codelica. Without GPU for detection, it works out of the box. Also using GPU version of ffmpeg is working fine. Model generation without any issues. Only getting the whole thing together breaks on the error above.

damsport11 avatar Jan 11 '23 11:01 damsport11

Cc @NateMeyer

NickM-27 avatar Jan 11 '23 12:01 NickM-27

@damsport11 What GPU are you using?

NickM-27 avatar Jan 11 '23 14:01 NickM-27

Yes, @damsport11 can you clarify what GPU and Host OS you're running?

I think we saw some users on UnRaid have this issue, related to the drivers that were installed on the Host.

I'll look into what could be going on, but my first hunch is it's an issue between the container and the host driver.

NateMeyer avatar Jan 11 '23 14:01 NateMeyer

System is ubuntu 20.04 GPU is RTX 2060 12 GB NVIDIA-SMI 525.60.11 CUDA V11.6.124 CuDNN 8.4.1 thanks for the help ;-)

damsport11 avatar Jan 11 '23 15:01 damsport11

FWIW, I bind mounted /usr/local/cuda-11.7/targets/x86_64-linux/lib/libnvrtc.so.11.7.99 from the host side to /usr/local/lib/python3.9/dist-packages/nvidia/cudnn/lib/libnvrtc.so in the container and everything came to life with detections working, etc. Just not sure if that if that should be magically getting passed in via some more official mechanism. :)

Codelica avatar Jan 11 '23 15:01 Codelica

yes i can confirm this is a workaround ;-)

damsport11 avatar Jan 11 '23 15:01 damsport11

Can you see if you are able to update the CUDA libraries on your host?

NateMeyer avatar Jan 12 '23 04:01 NateMeyer

I could give it a shot tonight, but what should I target? I'm at 11.7.0 currently so there is 11.7.1, 11.8.0 or 12.0.0.

Codelica avatar Jan 12 '23 16:01 Codelica

I believe the image is pulling in 11.7.1 libraries, so I would expect 11.7.1 to work. The 11.x drivers are supposed to be backwards compatible, so installing 11.8 shouldn't hurt. I've done my testing with 12.0 installed, and it worked fine with the 11.7.1 runtime libraries in the image.

So I would expect any of them to work. Would you mind stepping through them? If 11.7.1 works, we could add that to the documentation as a minimum version needed. I'll see if I can do similar testing this weekend.

NateMeyer avatar Jan 12 '23 18:01 NateMeyer

I have the same issue even with 12.0 installed on my host. Initially I didnt have the nvrtc libs installed on my host, but even after installing it only the above mentioned workaround with bind mounting libnvrtc.so.12 into the container worked for me.

dennispg avatar Jan 13 '23 18:01 dennispg

I can try some other versions this weekend, although I'm pretty doubtful the NVIDIA Docker runtime will pass libnvrtc.so from the host. I'm definitely not an expert, but that seems to go beyond what the NVIDIA toolkit passes in via the Docker runtime. I took a look at a couple other containers I run which seem to make use of NVRTC and both have it packaged in the image (via installation of a libnvrtc package).

It looks like Frigate is bringing in the NVIDIA resources via Python, so my wild ass Friday guess is that adding nvidia-cuda-nvrtc-cu11 to requirements-tensorrt.txt may do the trick.

Codelica avatar Jan 13 '23 19:01 Codelica

ok, thanks that is helpful. I'll try to recreate this this weekend. Have you also tried with the beta3 image?

NateMeyer avatar Jan 13 '23 22:01 NateMeyer

I am seeing the issue with beta3. I think @Codelica is right on.

dennispg avatar Jan 13 '23 22:01 dennispg

I'm on beta3 at this point also, which acts the same. Was hoping it might resolve a couple little items I've been trying to track down but am not confident enough to call bugs yet :)

Codelica avatar Jan 13 '23 23:01 Codelica

I am having the same issue with beta3 on Debian 11, installing libnvrtc11.2 and mounting the object mentioned above worked too.

- /usr/lib/x86_64-linux-gnu/libnvrtc.so.11.2:/usr/local/lib/python3.9/dist-packages/nvidia/cudnn/lib/libnvrtc.so

BBaoVanC avatar Jan 14 '23 05:01 BBaoVanC

OK. I've added the nvrtc and reworked the library loading a little bit. Can you please try this test image to see if it resolves the issue?

ghcr.io/natemeyer/frigate:0.12.0-0baae65-tensorrt

@BBaoVanC Which GPU are you using?

I'm wondering if there is something with how the model is generated, or certain GPU arch optimizations that rely on nvrtc when others don't? Or more likely the libcudnn_cnn_infer.so is only used in certain scenarios, which happens to depend on nvrtc.

NateMeyer avatar Jan 14 '23 15:01 NateMeyer

My GPU is a GTX 1050 Driver Version: 525.60.13 CUDA Version: 12.0 I am using the yolov7-tiny-416 model.

dennispg avatar Jan 14 '23 15:01 dennispg

@NateMeyer I just tried with your image and still get the error:

2023-01-14 09:41:22.576908660  Could not load library libcudnn_cnn_infer.so.8. Error: libnvrtc.so: cannot open shared object file: No such file or directory
2023-01-14 09:41:22.576911440  Fatal Python error: Aborted

dennispg avatar Jan 14 '23 15:01 dennispg

Thanks @dennispg I'll keep looking. I'm able to force loading of the libcudnn_cnn_infer.so.8 on my side, and it doesn't complain about libnvrtc.

NateMeyer avatar Jan 14 '23 15:01 NateMeyer

Aha! I've recreated this issue by regenerating the models.

Running the yolov4-tiny-416 model instead of yolov7 does not complain.

NateMeyer avatar Jan 14 '23 16:01 NateMeyer

I included which model I am using because I had a feeling it might be something like that.. glad you were able to pinpoint it!

dennispg avatar Jan 14 '23 16:01 dennispg

New test image with symlink: ghcr.io/natemeyer/frigate:0.12.0-9c641ec-tensorrt

NateMeyer avatar Jan 14 '23 16:01 NateMeyer

ghcr.io/natemeyer/frigate:0.12.0-9c641ec-tensorrt is working fine for me without the external bind mount. 👍

Codelica avatar Jan 14 '23 18:01 Codelica

Awesome! Thanks so much for helping troubleshoot this.

NateMeyer avatar Jan 14 '23 18:01 NateMeyer