Tdarr
Tdarr copied to clipboard
CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Now that I finally converted all of my videos using Tdarr I like how I can leave it running and when new videos are downloaded it compresses them. And this works... But then it dies with "CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected". Now I've configured my Docker container to run Tdarr using Nvidia and I have properly configured it so that that works. But then it breaks with that error. The solution is simply to restart the Docker container and then re-queue the videos but why does it break in the first place?
Another data point. When transcode fails in the docker container I can replicate the ffmpeg command on the desktop (IOW outside of the docker container) and it works fine.
Why does it break in the docker container?
Hmm seems like a bug between FFmpeg/Docker/Hardware, perhaps an FFmpeg update will fix the issue. The dev container has an FFmpeg update, looking to get it out soon.
I can try out the new version when it becomes available. I was suspecting it may be the Nvidia for Cocker containers (I have a foggy memory of installing that to get GPU transcoding to work in a Docker container) and something there broke down causing the CUDA_ERROR_NO_DEVICE. Lately, I'm seeing it just about transcodes a video or two and suddenly there are no devices left to GPU transcode. Restarting the Docker container always fixes the problem but it's essentially babysitting Tdarr, which shouldn't have to be the case.
FWIW I am seeing this exact behavior, happy to supply logs/test if it helps at all, a restart of the docker fixes it every time.
I had cron'ed a restart of the Docker container at midnight in an attempt to mitigate this issue but it remains.
It might be nice if Tdarr allowed access to their database so I could interrogate the state and restart the Docker container if need be but then again if the bug is fixed it would be unnecessary.
Out of curiosity, @Sc0th, what's your environment? Where are you running the Docker container? What OS/machine? And where's your server?
Apologies, I could have made that post slightly more useful!
I also tried the automated reboot, also to no avail. I am running the docker using podman on a VM running on Proxmox with PCI pass-through.
Some (maybe) useful detail:
GPU - Nvidia 1050 Ti
Proxmox 7.2-7
Linux infra01 4.18.0-372.19.1.el8_6.x86_64 #1 SMP Tue Aug 2 13:42:59 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux
AlmaLinux 8.6 (Sky Tiger)
nvidia-container-toolkit-1.10.0-1.x86_64
kmod-nvidia-515.65.01-1.el8_6.elrepo.x86_64
podman run --name tdarr-node -v /appdata01/INFRA/tdarr-node/configs:/app/configs -v /appdata01/INFRA/tdarr-node/logs:/app/logs -v /MEDIA:/media -v /appdata01/PROD/tdarr/transcode:/transcode -e "nodeID=infra01-tdar
r-node" -e "serverIP=x.x.x.x" -e "serverPort=8266" --net=host -e PUID=1000 -e PGID=1000 -e "NVIDIA_DRIVER_CAPABILITIES=all" -e "NVIDIA_VISIBLE_DEVICES=all" --gpus=all -d ghcr.io/haveagitgat/tdarr_node'
Tdarr Server & Node 2.00.18
I am using the following line monitored by Zabbix to alert me when it gets stuck:
cat /appdata01/PROD/tdarr/conf/server/Tdarr/DB2/StatisticsJSONDB/*.json | jq . | grep -ic err
A result higher than 0 indicates it's got stuck, I did look at using this to 'self heal' by restating the docker on trigger, however the jobs do not appear to requeue automatically so that did not quite go to plan.
I wait with greatful anticipation of the next release with hope of a fix!
Yeah, I think that it will end up being some interaction between Tdarr and the Nvidia runtime for the Docker container that causes Nvidia to lose track of available devices thus reporting NO_CUDA_DEVICE.
I'm running on my Thelio desktop running Ubuntu 22.04. I have the following:
Earth:apt list| grep nvidia-container
WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
libnvidia-container-dev/bionic 1.10.0-1 amd64
libnvidia-container-tools/bionic,now 1.10.0-1 amd64 [installed,automatic]
libnvidia-container1-dbg/bionic 1.10.0-1 amd64
libnvidia-container1/bionic,now 1.10.0-1 amd64 [installed,automatic]
nvidia-container-runtime/bionic 3.10.0-1 all
nvidia-container-toolkit/bionic,now 1.10.0-1 amd64 [installed]
Earth:
Still have this issue however today I think I captured a log of Tdarr working on a video when the CUDA_ERROR_NO_DEVICE happened right in the middle. Maybe this log will help in debugging this bug . to7uvYRwN-log.txt .
I am trying downgrading nvidia-docker2 back to 2.10.0-1 from 2.11.0-1 and seeing if that makes any difference as a workaround. If not I may try 2.9.1-1.
I am having this same issue on Ubuntu 22.04 LTS.
Downgrading to the nvidia-docker2 version 2.9.1-1 seems to be a workaround to this issue, at least in the day of testing. If it STOPS working I will let you know.
Annoying to have my updates reporting a heald-back package, but better than Tdarr-node breaking every hour or two.
Ok thanks for the update 👍
Thanks. Downgraded to 2.9.1-1. Will report what happens.
You have seen this 'With the release of Docker 19.03, usage of nvidia-docker2 packages is deprecated since NVIDIA GPUs are now natively supported as devices in the Docker runtime.' here: - https://docs.nvidia.com/ai-enterprise/deployment-guide/dg-docker.html ?
I am using podman, with:
nvidia-container-toolkit-1.10.0-1.x86_64
nvidia-x11-drv-515.65.01-1.el8_6.elrepo.x86_64
libnvidia-container1-1.10.0-1.x86_64
nvidia-x11-drv-libs-515.65.01-1.el8_6.elrepo.x86_64
libnvidia-container-tools-1.10.0-1.x86_64
kmod-nvidia-515.65.01-1.el8_6.elrepo.x86_64
And am seeing the same issue.
So are you saying I can just remove nvidia-docker2 and restart the docker container and it'll all work?
That would depend on the version of docker you are running.
Earth:apt list | grep ^docker
WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
docker-clean/jammy,jammy 2.0.4-4 all
docker-compose/jammy,jammy 1.29.2-1 all
docker-doc/jammy,jammy 20.10.12-0ubuntu4 all
docker-registry/jammy 2.8.0+ds1-4 amd64
docker.io/jammy,now 20.10.12-0ubuntu4 amd64 [installed]
docker2aci/jammy 0.17.2+dfsg-2.1 amd64
docker/jammy,jammy 1.5-2 all
Earth:
Not massively familiar with the 'apt' set of commands, but would that not show what is available rather than what is installed? apt list --installed | grep -i docker
might work? or perhaps docker --version
?
Edit - duh I just saw the [installed] - I guess that means version 20.x so in theory.....
Earth:apt list --installed | grep -i docker
WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
docker.io/jammy,now 20.10.12-0ubuntu4 amd64 [installed]
nvidia-docker2/bionic,now 2.9.1-1 all [installed,upgradable to: 2.11.0-1]
Earth:
The [installed] marks these as installed. Note the downgrade of nvidia-docker2 to 2.9.1-1 from 2.11.0-1.
Apologies, noticed that a little late, I have downgraded my installs by one release also, see if it has any impact...
nvidia-x11-drv-515.65.01-1.el8_6.elrepo.x86_64
libnvidia-container-tools-1.9.0-1.x86_64
libnvidia-container1-1.9.0-1.x86_64
nvidia-x11-drv-libs-515.65.01-1.el8_6.elrepo.x86_64
kmod-nvidia-515.65.01-1.el8_6.elrepo.x86_64
nvidia-container-toolkit-1.9.0-1.x86_64
podman version 4.1.1
🤞
OK, I think I have got it working WITHOUT the nvidia-docker2 package at all.
It requires a docker-compose update to 1.29 and a ver. 20 of docker itself to use the built-in NVIDIA functionality.
Docker-compose needs to be modified to let it know about the GPU(s):
tdarr-node:
container_name: tdarr-node
image: haveagitgat/tdarr_node:latest
restart: unless-stopped
network_mode: service:tdarr
# runtime: nvidia # Comment this out. Not needed with the built-in NVIDIA support
deploy: # ADD this section
resources:
reservations:
devices:
- capabilities: [gpu]
environment:
- TZ=America/New_York
- PUID=1000
- PGID=1000
- UMASK_SET=002
- nodeID=Odin
- nodeIP=0.0.0.0
- nodePort=8267
- serverIP=0.0.0.0
- serverPort=8266
- NVIDIA_VISIBLE_DEVICES=all # Not sure if these are still needed
- NVIDIA_DRIVER_CAPABILITIES=all
volumes:
- <Your volumes here>
depends_on:
- tdarr
So far (just an hour and 3-4 transcodes) the GPU is still working in the node.
Thanks for St0th for pointing out that nvidia-docker2 was deprecated. Recommend that examples and documentation be updated to match the new approach and eliminate the need for nvidia-docker2, at least on newer versions of docker.
How are you verifying that the GPUs are being used? I ask because I had to compile up nvtop to monitor GPU usage.
Also, when I recreate my Tdarr_node docker image (I use docker run) it basically does a chown for like an hour which I find useless because it seems to be doing chown on the database to set the owner to 1000:1000 but with 31k entries in the database from my transcoding my entire library, it takes like an hour before this process stops and the Tdarr_node is registered. Until this process finishes the Tdarr_node is not registered nor does it show up in the web interface.
Do you experience this too?
How are you verifying that the GPUs are being used? I ask because I had to compile up nvtop to monitor GPU usage.
I kicked off a transpose and ran nvidia-smi in another window. The ffmpeg process showed up in the process list. Plus the transpose plug-in I use only works with NVEC.
Also, when I recreate my Tdarr_node docker image (I use docker run) it basically does a chown for like an hour which I find useless because it seems to be doing chown on the database to set the owner to 1000:1000 but with 31k entries in the database from my transcoding my entire library, it takes like an hour before this process stops and the Tdarr_node is registered. Until this process finishes the Tdarr_node is not registered nor does it show up in the web interface.
Do you experience this too?
No, but all my files were already owned by 1000:1000. I just copied a chunk from my compose file. Thus would need to be modified to match your configuration. Your tdarr server process is likely not named Odin either.
No, but all my files were already owned by 1000:1000. I just copied a chunk from my compose file. Thus would need to be modified to match your configuration. Your tdarr server process is likely not named Odin either.
Correction, the script I have that does docker run first stops the old container, removes it then uses docker run to re-create the container. I believe this causes this long-running chown to happen.
Not sure what you mean by tdarr server being named Odin...
No, but all my files were already owned by 1000:1000. I just copied a chunk from my compose file. Thus would need to be modified to match your configuration. Your tdarr server process is likely not named Odin either.
Correction, the script I have that does docker run first stops the old container, removes it then uses docker run to re-create the container. I believe this causes this long-running chown to happen.
Not sure what you mean by tdarr server being named Odin...
The "NodeID" in my chunk of compose file above. The server I run this on is named "Odin" and it says that in the file. You will likely want to change that name.
Also: really consider moving to docker-compose from docker run. There is a learning curve but MAN is it easier to manage once you get it working.
Not looking to move over to docker compose. I have a simple script. I run it once. Tdarr should then run in a container in the background with no more input from me.
There's also podman or something like that and lots of other Docker technologies. I'm not really that interested in having yet another learning curve. I have enough of them already.
On the plus side, Tdarr node seems to be holding up without that pesky nvidia-docker2 module...
Alas, this seemed to be working but just failed today. Note that I'm running without nvidia-docker2 installed nor used by the Docker container. I did remove it but I didn't fully remove it.
arth:apt list | grep nvidia-docker
WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
nvidia-docker2/bionic,now 2.11.0-1 all [residual-config]
Earth:
Not sure why the Docker container would use nvidia-docker2 if it's not installed but there is residual configuration so I did complete remove of nvidia-docker2 configuration and restarted the Docker container.
Since dropping libnvidia-container-tools
down to version 1.9.0-1
I have not seen this issue reoccur, will report back if that changes.
As I still am seeing NO_CUDA_DEVICE I've downgraded libnvidia-container-tools
down to version 1.9.0-1
too. Fingers crossed...
Spoke too soon:
Earth:docker start tdarr_node
Error response from daemon: exec: "nvidia-container-runtime-hook": executable file not found in $PATH
Error: failed to start containers: tdarr_node
Earth:sudo find / -mount -type f -name nvidia-container-runtime-hook
Earth:
Hmmm... Now docker won't start. Seems to do an apt purge nvidia-docker2
also removed nvidia-container-toolkit
, which had /bin/nvidia-container-runtime-hook
which seems to still be required.
More insights - if I downgrade libnvidia-container-tools
to 1.9.0-1 then /bin/nvidia-container-runtime-hook
goes away as does nvidia-container-toolkit
. If I then install nvidia-container-toolkit
then it updates libnvidia-container-tools
to 1.10.0-1.
So, I have tdarr_node running now with nvidia-container-tools
and libnvidia-container-tools
set to 1.10.0-1. We'll see what happens but I think I'm still getting NO_CUDA_DEVICES error after about a day.
@Sc0th, perhaps you could detail to me what versions/setup you have and if that's still working for you.
[INF] infra01:~# find / -mount -type f -name nvidia-container-runtime-hook
[INF] infra01:~#
You should really consider moving away from docker, given that it is ~~dead~~ very much on it's way out. cri-o & podman is painless.
FWIW I run the tdarr server in a k8s (1.2.4) container (flannel/meatllb/crio-o/podman)
[INF] infra01:~# kubectl get node -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
control Ready control-plane 10d v1.24.4 192.168.60.130 <none> AlmaLinux 8.6 (Sky Tiger) 4.18.0-372.19.1.el8_6.x86_64 cri-o://1.24.2
node01 Ready <none> 10d v1.24.4 192.168.60.131 <none> AlmaLinux 8.6 (Sky Tiger) 4.18.0-372.19.1.el8_6.x86_64 cri-o://1.24.2
node02 Ready <none> 10d v1.24.4 192.168.60.132 <none> AlmaLinux 8.6 (Sky Tiger) 4.18.0-372.19.1.el8_6.x86_64 cri-o://1.24.2
node03 Ready <none> 10d v1.24.4 192.168.60.133 <none> AlmaLinux 8.6 (Sky Tiger) 4.18.0-372.19.1.el8_6.x86_64 cri-o://1.24.2
node04 Ready <none> 10d v1.24.4 192.168.60.134 <none> AlmaLinux 8.6 (Sky Tiger) 4.18.0-372.19.1.el8_6.x86_64 cri-o://1.24.2
node05 Ready <none> 10d v1.24.4 192.168.60.135 <none> AlmaLinux 8.6 (Sky Tiger) 4.18.0-372.19.1.el8_6.x86_64 cri-o://1.24.2
node06 Ready <none> 10d v1.24.4 192.168.60.136 <none> AlmaLinux 8.6 (Sky Tiger) 4.18.0-372.19.1.el8_6.x86_64 cri-o://1.24.2
node07 Ready <none> 10d v1.24.4 192.168.60.138 <none> AlmaLinux 8.6 (Sky Tiger) 4.18.0-372.19.1.el8_6.x86_64 cri-o://1.24.2
I run the tdarr node on a VM as a non orchestrated container (cri-o/podman) all run on Alma8 sat on Proxmox. Restart of the tdarr node takes ~4 seconds.
[INF] infra01:~# time podman restart tdarr-node
41a3418756829d48a774d9760637bdb335543543f6a8edb10d13e9bb1b621291
real 0m4.038s
user 0m0.033s
sys 0m0.023s
[INF] infra01:~#
Server restart ~1.5 seconds
[INF] infra01:~# time admin restart tdarr
deployment.apps/tdarr scaled
deployment.apps/tdarr scaled
real 0m1.496s
user 0m0.217s
sys 0m0.022s