Tdarr icon indicating copy to clipboard operation
Tdarr copied to clipboard

CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected

Open adefaria opened this issue 2 years ago • 58 comments

Now that I finally converted all of my videos using Tdarr I like how I can leave it running and when new videos are downloaded it compresses them. And this works... But then it dies with "CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected". Now I've configured my Docker container to run Tdarr using Nvidia and I have properly configured it so that that works. But then it breaks with that error. The solution is simply to restart the Docker container and then re-queue the videos but why does it break in the first place?

QjALSDImR-log.txt

adefaria avatar Jul 25 '22 00:07 adefaria

Another data point. When transcode fails in the docker container I can replicate the ffmpeg command on the desktop (IOW outside of the docker container) and it works fine.

Why does it break in the docker container?

adefaria avatar Aug 02 '22 22:08 adefaria

Hmm seems like a bug between FFmpeg/Docker/Hardware, perhaps an FFmpeg update will fix the issue. The dev container has an FFmpeg update, looking to get it out soon.

HaveAGitGat avatar Aug 08 '22 23:08 HaveAGitGat

I can try out the new version when it becomes available. I was suspecting it may be the Nvidia for Cocker containers (I have a foggy memory of installing that to get GPU transcoding to work in a Docker container) and something there broke down causing the CUDA_ERROR_NO_DEVICE. Lately, I'm seeing it just about transcodes a video or two and suddenly there are no devices left to GPU transcode. Restarting the Docker container always fixes the problem but it's essentially babysitting Tdarr, which shouldn't have to be the case.

adefaria avatar Aug 09 '22 03:08 adefaria

FWIW I am seeing this exact behavior, happy to supply logs/test if it helps at all, a restart of the docker fixes it every time.

Sc0th avatar Aug 18 '22 06:08 Sc0th

I had cron'ed a restart of the Docker container at midnight in an attempt to mitigate this issue but it remains.

It might be nice if Tdarr allowed access to their database so I could interrogate the state and restart the Docker container if need be but then again if the bug is fixed it would be unnecessary.

Out of curiosity, @Sc0th, what's your environment? Where are you running the Docker container? What OS/machine? And where's your server?

adefaria avatar Aug 18 '22 15:08 adefaria

Apologies, I could have made that post slightly more useful!

I also tried the automated reboot, also to no avail. I am running the docker using podman on a VM running on Proxmox with PCI pass-through.

Some (maybe) useful detail:

GPU - Nvidia 1050 Ti
Proxmox 7.2-7

Linux infra01 4.18.0-372.19.1.el8_6.x86_64 #1 SMP Tue Aug 2 13:42:59 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux
AlmaLinux 8.6 (Sky Tiger)

nvidia-container-toolkit-1.10.0-1.x86_64
kmod-nvidia-515.65.01-1.el8_6.elrepo.x86_64

podman run --name tdarr-node -v /appdata01/INFRA/tdarr-node/configs:/app/configs -v /appdata01/INFRA/tdarr-node/logs:/app/logs -v /MEDIA:/media -v /appdata01/PROD/tdarr/transcode:/transcode -e "nodeID=infra01-tdar
r-node"  -e "serverIP=x.x.x.x" -e "serverPort=8266" --net=host -e PUID=1000 -e PGID=1000 -e "NVIDIA_DRIVER_CAPABILITIES=all" -e "NVIDIA_VISIBLE_DEVICES=all" --gpus=all -d ghcr.io/haveagitgat/tdarr_node'

Tdarr Server & Node 2.00.18

I am using the following line monitored by Zabbix to alert me when it gets stuck:

cat /appdata01/PROD/tdarr/conf/server/Tdarr/DB2/StatisticsJSONDB/*.json | jq . | grep -ic err

A result higher than 0 indicates it's got stuck, I did look at using this to 'self heal' by restating the docker on trigger, however the jobs do not appear to requeue automatically so that did not quite go to plan.

I wait with greatful anticipation of the next release with hope of a fix!

Sc0th avatar Aug 19 '22 00:08 Sc0th

Yeah, I think that it will end up being some interaction between Tdarr and the Nvidia runtime for the Docker container that causes Nvidia to lose track of available devices thus reporting NO_CUDA_DEVICE.

I'm running on my Thelio desktop running Ubuntu 22.04. I have the following:

Earth:apt list| grep nvidia-container

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

libnvidia-container-dev/bionic 1.10.0-1 amd64
libnvidia-container-tools/bionic,now 1.10.0-1 amd64 [installed,automatic]
libnvidia-container1-dbg/bionic 1.10.0-1 amd64
libnvidia-container1/bionic,now 1.10.0-1 amd64 [installed,automatic]
nvidia-container-runtime/bionic 3.10.0-1 all
nvidia-container-toolkit/bionic,now 1.10.0-1 amd64 [installed]
Earth:

adefaria avatar Aug 19 '22 03:08 adefaria

Still have this issue however today I think I captured a log of Tdarr working on a video when the CUDA_ERROR_NO_DEVICE happened right in the middle. Maybe this log will help in debugging this bug . to7uvYRwN-log.txt .

adefaria avatar Aug 23 '22 16:08 adefaria

I am trying downgrading nvidia-docker2 back to 2.10.0-1 from 2.11.0-1 and seeing if that makes any difference as a workaround. If not I may try 2.9.1-1.

I am having this same issue on Ubuntu 22.04 LTS.

Lebo77 avatar Aug 27 '22 20:08 Lebo77

Downgrading to the nvidia-docker2 version 2.9.1-1 seems to be a workaround to this issue, at least in the day of testing. If it STOPS working I will let you know.

Annoying to have my updates reporting a heald-back package, but better than Tdarr-node breaking every hour or two.

Lebo77 avatar Aug 28 '22 23:08 Lebo77

Ok thanks for the update 👍

HaveAGitGat avatar Aug 28 '22 23:08 HaveAGitGat

Thanks. Downgraded to 2.9.1-1. Will report what happens.

adefaria avatar Aug 29 '22 00:08 adefaria

You have seen this 'With the release of Docker 19.03, usage of nvidia-docker2 packages is deprecated since NVIDIA GPUs are now natively supported as devices in the Docker runtime.' here: - https://docs.nvidia.com/ai-enterprise/deployment-guide/dg-docker.html ?

I am using podman, with:

nvidia-container-toolkit-1.10.0-1.x86_64
nvidia-x11-drv-515.65.01-1.el8_6.elrepo.x86_64
libnvidia-container1-1.10.0-1.x86_64
nvidia-x11-drv-libs-515.65.01-1.el8_6.elrepo.x86_64
libnvidia-container-tools-1.10.0-1.x86_64
kmod-nvidia-515.65.01-1.el8_6.elrepo.x86_64

And am seeing the same issue.

Sc0th avatar Aug 29 '22 00:08 Sc0th

So are you saying I can just remove nvidia-docker2 and restart the docker container and it'll all work?

adefaria avatar Aug 29 '22 00:08 adefaria

That would depend on the version of docker you are running.

Sc0th avatar Aug 29 '22 00:08 Sc0th

Earth:apt list | grep ^docker

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

docker-clean/jammy,jammy 2.0.4-4 all
docker-compose/jammy,jammy 1.29.2-1 all
docker-doc/jammy,jammy 20.10.12-0ubuntu4 all
docker-registry/jammy 2.8.0+ds1-4 amd64
docker.io/jammy,now 20.10.12-0ubuntu4 amd64 [installed]
docker2aci/jammy 0.17.2+dfsg-2.1 amd64
docker/jammy,jammy 1.5-2 all
Earth:

adefaria avatar Aug 29 '22 00:08 adefaria

Not massively familiar with the 'apt' set of commands, but would that not show what is available rather than what is installed? apt list --installed | grep -i docker might work? or perhaps docker --version ?

Edit - duh I just saw the [installed] - I guess that means version 20.x so in theory.....

Sc0th avatar Aug 29 '22 01:08 Sc0th

Earth:apt list --installed | grep -i docker

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

docker.io/jammy,now 20.10.12-0ubuntu4 amd64 [installed]
nvidia-docker2/bionic,now 2.9.1-1 all [installed,upgradable to: 2.11.0-1]
Earth:

The [installed] marks these as installed. Note the downgrade of nvidia-docker2 to 2.9.1-1 from 2.11.0-1.

adefaria avatar Aug 29 '22 01:08 adefaria

Apologies, noticed that a little late, I have downgraded my installs by one release also, see if it has any impact...

nvidia-x11-drv-515.65.01-1.el8_6.elrepo.x86_64
libnvidia-container-tools-1.9.0-1.x86_64
libnvidia-container1-1.9.0-1.x86_64
nvidia-x11-drv-libs-515.65.01-1.el8_6.elrepo.x86_64
kmod-nvidia-515.65.01-1.el8_6.elrepo.x86_64
nvidia-container-toolkit-1.9.0-1.x86_64
podman version 4.1.1

🤞

Sc0th avatar Aug 29 '22 01:08 Sc0th

OK, I think I have got it working WITHOUT the nvidia-docker2 package at all.

It requires a docker-compose update to 1.29 and a ver. 20 of docker itself to use the built-in NVIDIA functionality.

Docker-compose needs to be modified to let it know about the GPU(s):

 tdarr-node:
        container_name: tdarr-node
        image: haveagitgat/tdarr_node:latest
        restart: unless-stopped
        network_mode: service:tdarr
#        runtime: nvidia # Comment this out. Not needed with the built-in NVIDIA support
        deploy: # ADD this section
          resources:
            reservations:
              devices:
                - capabilities: [gpu]
        environment:
            - TZ=America/New_York
            - PUID=1000
            - PGID=1000
            - UMASK_SET=002
            - nodeID=Odin
            - nodeIP=0.0.0.0
            - nodePort=8267
            - serverIP=0.0.0.0
            - serverPort=8266
            - NVIDIA_VISIBLE_DEVICES=all # Not sure if these are still needed
            - NVIDIA_DRIVER_CAPABILITIES=all
        volumes:
            - <Your volumes here>
        depends_on:
            - tdarr

So far (just an hour and 3-4 transcodes) the GPU is still working in the node.

Thanks for St0th for pointing out that nvidia-docker2 was deprecated. Recommend that examples and documentation be updated to match the new approach and eliminate the need for nvidia-docker2, at least on newer versions of docker.

Lebo77 avatar Aug 29 '22 03:08 Lebo77

How are you verifying that the GPUs are being used? I ask because I had to compile up nvtop to monitor GPU usage.

Also, when I recreate my Tdarr_node docker image (I use docker run) it basically does a chown for like an hour which I find useless because it seems to be doing chown on the database to set the owner to 1000:1000 but with 31k entries in the database from my transcoding my entire library, it takes like an hour before this process stops and the Tdarr_node is registered. Until this process finishes the Tdarr_node is not registered nor does it show up in the web interface.

Screenshot at 2022-08-29 07-06-18

Do you experience this too?

adefaria avatar Aug 29 '22 14:08 adefaria

How are you verifying that the GPUs are being used? I ask because I had to compile up nvtop to monitor GPU usage.

I kicked off a transpose and ran nvidia-smi in another window. The ffmpeg process showed up in the process list. Plus the transpose plug-in I use only works with NVEC.

Also, when I recreate my Tdarr_node docker image (I use docker run) it basically does a chown for like an hour which I find useless because it seems to be doing chown on the database to set the owner to 1000:1000 but with 31k entries in the database from my transcoding my entire library, it takes like an hour before this process stops and the Tdarr_node is registered. Until this process finishes the Tdarr_node is not registered nor does it show up in the web interface.

Screenshot at 2022-08-29 07-06-18

Do you experience this too?

No, but all my files were already owned by 1000:1000. I just copied a chunk from my compose file. Thus would need to be modified to match your configuration. Your tdarr server process is likely not named Odin either.

Lebo77 avatar Aug 29 '22 17:08 Lebo77

No, but all my files were already owned by 1000:1000. I just copied a chunk from my compose file. Thus would need to be modified to match your configuration. Your tdarr server process is likely not named Odin either.

Correction, the script I have that does docker run first stops the old container, removes it then uses docker run to re-create the container. I believe this causes this long-running chown to happen.

Not sure what you mean by tdarr server being named Odin...

adefaria avatar Aug 29 '22 17:08 adefaria

No, but all my files were already owned by 1000:1000. I just copied a chunk from my compose file. Thus would need to be modified to match your configuration. Your tdarr server process is likely not named Odin either.

Correction, the script I have that does docker run first stops the old container, removes it then uses docker run to re-create the container. I believe this causes this long-running chown to happen.

Not sure what you mean by tdarr server being named Odin...

The "NodeID" in my chunk of compose file above. The server I run this on is named "Odin" and it says that in the file. You will likely want to change that name.

Also: really consider moving to docker-compose from docker run. There is a learning curve but MAN is it easier to manage once you get it working.

Lebo77 avatar Aug 29 '22 20:08 Lebo77

Not looking to move over to docker compose. I have a simple script. I run it once. Tdarr should then run in a container in the background with no more input from me.

There's also podman or something like that and lots of other Docker technologies. I'm not really that interested in having yet another learning curve. I have enough of them already.

On the plus side, Tdarr node seems to be holding up without that pesky nvidia-docker2 module...

adefaria avatar Aug 29 '22 20:08 adefaria

Alas, this seemed to be working but just failed today. Note that I'm running without nvidia-docker2 installed nor used by the Docker container. I did remove it but I didn't fully remove it.

arth:apt list  | grep nvidia-docker

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

nvidia-docker2/bionic,now 2.11.0-1 all [residual-config]
Earth:

Not sure why the Docker container would use nvidia-docker2 if it's not installed but there is residual configuration so I did complete remove of nvidia-docker2 configuration and restarted the Docker container.

IDjtt46V_-log.txt

adefaria avatar Aug 31 '22 14:08 adefaria

Since dropping libnvidia-container-tools down to version 1.9.0-1 I have not seen this issue reoccur, will report back if that changes.

Sc0th avatar Sep 01 '22 02:09 Sc0th

As I still am seeing NO_CUDA_DEVICE I've downgraded libnvidia-container-tools down to version 1.9.0-1 too. Fingers crossed...

Spoke too soon:

Earth:docker start tdarr_node
Error response from daemon: exec: "nvidia-container-runtime-hook": executable file not found in $PATH
Error: failed to start containers: tdarr_node
Earth:sudo find / -mount -type f -name nvidia-container-runtime-hook
Earth:

Hmmm... Now docker won't start. Seems to do an apt purge nvidia-docker2 also removed nvidia-container-toolkit, which had /bin/nvidia-container-runtime-hook which seems to still be required.

More insights - if I downgrade libnvidia-container-tools to 1.9.0-1 then /bin/nvidia-container-runtime-hook goes away as does nvidia-container-toolkit. If I then install nvidia-container-toolkit then it updates libnvidia-container-tools to 1.10.0-1.

So, I have tdarr_node running now with nvidia-container-tools and libnvidia-container-tools set to 1.10.0-1. We'll see what happens but I think I'm still getting NO_CUDA_DEVICES error after about a day.

@Sc0th, perhaps you could detail to me what versions/setup you have and if that's still working for you.

adefaria avatar Sep 01 '22 17:09 adefaria

[INF] infra01:~# find / -mount -type f -name nvidia-container-runtime-hook
[INF] infra01:~#

You should really consider moving away from docker, given that it is ~~dead~~ very much on it's way out. cri-o & podman is painless.

Sc0th avatar Sep 02 '22 01:09 Sc0th

FWIW I run the tdarr server in a k8s (1.2.4) container (flannel/meatllb/crio-o/podman)

[INF] infra01:~# kubectl get node -o wide
NAME      STATUS   ROLES           AGE   VERSION   INTERNAL-IP      EXTERNAL-IP   OS-IMAGE                    KERNEL-VERSION                 CONTAINER-RUNTIME
control   Ready    control-plane   10d   v1.24.4   192.168.60.130   <none>        AlmaLinux 8.6 (Sky Tiger)   4.18.0-372.19.1.el8_6.x86_64   cri-o://1.24.2
node01    Ready    <none>          10d   v1.24.4   192.168.60.131   <none>        AlmaLinux 8.6 (Sky Tiger)   4.18.0-372.19.1.el8_6.x86_64   cri-o://1.24.2
node02    Ready    <none>          10d   v1.24.4   192.168.60.132   <none>        AlmaLinux 8.6 (Sky Tiger)   4.18.0-372.19.1.el8_6.x86_64   cri-o://1.24.2
node03    Ready    <none>          10d   v1.24.4   192.168.60.133   <none>        AlmaLinux 8.6 (Sky Tiger)   4.18.0-372.19.1.el8_6.x86_64   cri-o://1.24.2
node04    Ready    <none>          10d   v1.24.4   192.168.60.134   <none>        AlmaLinux 8.6 (Sky Tiger)   4.18.0-372.19.1.el8_6.x86_64   cri-o://1.24.2
node05    Ready    <none>          10d   v1.24.4   192.168.60.135   <none>        AlmaLinux 8.6 (Sky Tiger)   4.18.0-372.19.1.el8_6.x86_64   cri-o://1.24.2
node06    Ready    <none>          10d   v1.24.4   192.168.60.136   <none>        AlmaLinux 8.6 (Sky Tiger)   4.18.0-372.19.1.el8_6.x86_64   cri-o://1.24.2
node07    Ready    <none>          10d   v1.24.4   192.168.60.138   <none>        AlmaLinux 8.6 (Sky Tiger)   4.18.0-372.19.1.el8_6.x86_64   cri-o://1.24.2

I run the tdarr node on a VM as a non orchestrated container (cri-o/podman) all run on Alma8 sat on Proxmox. Restart of the tdarr node takes ~4 seconds.

[INF] infra01:~# time podman restart tdarr-node
41a3418756829d48a774d9760637bdb335543543f6a8edb10d13e9bb1b621291

real    0m4.038s
user    0m0.033s
sys     0m0.023s
[INF] infra01:~#

Server restart ~1.5 seconds

[INF] infra01:~# time admin restart tdarr
deployment.apps/tdarr scaled
deployment.apps/tdarr scaled

real    0m1.496s
user    0m0.217s
sys     0m0.022s

Sc0th avatar Sep 02 '22 01:09 Sc0th