k8s-device-plugin
k8s-device-plugin copied to clipboard
nvidia-device-plugin-daemonset CrashLoopBackoff in Truenas Scale Dragonfish
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
1. Quick Debug Information
- OS/Version(e.g. RHEL8.6, Ubuntu22.04): Truenas Scale Dragonfish-24.04.1.1
- Kernel Version: 6.6.29-production+truenas
- Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): containerd
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): k3s
2. Issue or feature description
Attempting to use a P2000 card as a hardware transcoder, the hardware is listed for VM isolation purposes but not for container sharing. It seems the pod nvidia-device-plugin-daemonset-XXXXX is being restarted.. by something unknown.
I've approached the wider Truenas scale community - a number of people are reporting the problem with the latest scale version yet others are fine. I filed a bug with ix-systems who promptly pointed me back to the forums (my post). I'm quite happy to debug my system but my history is Ubuntu/Docker and so help is needed with K3S/K8S here. Tell me what to type and I'll feed back.
3. Information to attach (optional if deemed irrelevant)
Common error checking:
- [ ] The output of
nvidia-smi -aon your host
root@truenas[~]# nvidia-smi -a
==============NVSMI LOG==============
Timestamp : Mon Jul 8 23:06:00 2024
Driver Version : 545.23.08
CUDA Version : 12.3
Attached GPUs : 1
GPU 00000000:04:00.0
Product Name : Quadro P2000
Product Brand : Quadro
Product Architecture : Pascal
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Disabled
Addressing Mode : N/A
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Let me know if you need further.
- [ ] Your docker configuration file (e.g:
/etc/docker/daemon.json) This might be the relevant file here:
root@truenas[...applications/k3s/agent/etc/containerd]# cat config.toml
# File generated by k3s. DO NOT EDIT. Use config.toml.tmpl instead.
version = 2
[plugins."io.containerd.internal.v1.opt"]
path = "/mnt/orion/ix-applications/k3s/agent/containerd"
[plugins."io.containerd.grpc.v1.cri"]
stream_server_address = "127.0.0.1"
stream_server_port = "10010"
enable_selinux = false
enable_unprivileged_ports = true
enable_unprivileged_icmp = true
sandbox_image = "rancher/mirrored-pause:3.6"
[plugins."io.containerd.grpc.v1.cri".containerd]
snapshotter = "overlayfs"
disable_snapshot_annotations = true
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
BinaryName = "/usr/bin/nvidia-container-runtime"
- [ ] The k8s-device-plugin container logs
truenas% sudo k3s kubectl logs -p pod/nvidia-device-plugin-daemonset-k58cb --namespace=kube-system
2024/07/04 20:22:39 Starting FS watcher.
2024/07/04 20:22:39 Starting OS watcher.
2024/07/04 20:22:39 Starting Plugins.
2024/07/04 20:22:39 Loading configuration.
2024/07/04 20:22:39 Updating config with default resource matching patterns.
2024/07/04 20:22:39
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "none",
"failOnInitError": true,
"nvidiaDriverRoot": "/",
"gdsEnabled": false,
"mofedEnabled": false,
"plugin": {
"passDeviceSpecs": false,
"deviceListStrategy": "envvar",
"deviceIDStrategy": "uuid"
}
},
"resources": {
"gpus": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
]
},
"sharing": {
"timeSlicing": {
"resources": [
{
"name": "nvidia.com/gpu",
"devices": "all",
"replicas": 5
}
]
}
}
}
2024/07/04 20:22:39 Retreiving plugins.
2024/07/04 20:22:39 Detected NVML platform: found NVML library
2024/07/04 20:22:39 Detected non-Tegra platform: /sys/devices/soc0/family file not found
2024/07/04 20:22:39 Starting GRPC server for 'nvidia.com/gpu'
2024/07/04 20:22:39 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2024/07/04 20:22:39 Registered device plugin for 'nvidia.com/gpu' with Kubelet
2024/07/04 20:23:39 Received signal "terminated", shutting down.
2024/07/04 20:23:39 Stopping plugins.
2024/07/04 20:23:39 Stopping to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
- [ ] The kubelet logs on the node (e.g:
sudo journalctl -r -u kubelet) This appears to repeat itself:
root@truenas[.../ix-applications/k3s/agent/containerd]# grep f19967bd-8ec3-4227-9449-9e5a51dc752e containerd.log
time="2024-07-08T22:11:43.936476667+01:00" level=info msg="RunPodSandbox for &PodSandboxMetadata{Name:nvidia-device-plugin-daemonset-bdx5d,Uid:f19967bd-8ec3-4227-9449-9e5a51dc752e,Namespace:kube-system,Attempt:0,}"
I0708 22:11:44.249718 7550 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"kube-system", Name:"nvidia-device-plugin-daemonset-bdx5d", UID:"f19967bd-8ec3-4227-9449-9e5a51dc752e", APIVersion:"v1", ResourceVersion:"545238", FieldPath:""}): type: 'Normal' reason: 'AddedInterface' Add eth0 [172.16.5.69/16] from ix-net
time="2024-07-08T22:11:45.347893032+01:00" level=info msg="RunPodSandbox for &PodSandboxMetadata{Name:nvidia-device-plugin-daemonset-bdx5d,Uid:f19967bd-8ec3-4227-9449-9e5a51dc752e,Namespace:kube-system,Attempt:0,} returns sandbox id \"7f0de1c3f6d5e7a8eabc4da89c9b4df3fbee755bd9f0a53cc5d4d546df8544dc\""
Alongside:
root@truenas[.../ix-applications/k3s/agent/containerd]# grep 7f0de1c3f6d5e7a8eabc4da89c9b4df3fbee755bd9f0a53cc5d4d546df8544dc containerd.log
time="2024-07-08T22:11:45.347893032+01:00" level=info msg="RunPodSandbox for &PodSandboxMetadata{Name:nvidia-device-plugin-daemonset-bdx5d,Uid:f19967bd-8ec3-4227-9449-9e5a51dc752e,Namespace:kube-system,Attempt:0,} returns sandbox id \"7f0de1c3f6d5e7a8eabc4da89c9b4df3fbee755bd9f0a53cc5d4d546df8544dc\""
time="2024-07-08T22:11:45.349271612+01:00" level=info msg="CreateContainer within sandbox \"7f0de1c3f6d5e7a8eabc4da89c9b4df3fbee755bd9f0a53cc5d4d546df8544dc\" for container &ContainerMetadata{Name:nvidia-device-plugin-ctr,Attempt:0,}"
time="2024-07-08T22:11:46.443324067+01:00" level=info msg="CreateContainer within sandbox \"7f0de1c3f6d5e7a8eabc4da89c9b4df3fbee755bd9f0a53cc5d4d546df8544dc\" for &ContainerMetadata{Name:nvidia-device-plugin-ctr,Attempt:0,} returns container id \"3609fcda6afb625b8a195815bf146a67563f22be0ea4cf01ac3f1722368709fb\""
time="2024-07-08T22:11:58.839012991+01:00" level=info msg="StopPodSandbox for \"7f0de1c3f6d5e7a8eabc4da89c9b4df3fbee755bd9f0a53cc5d4d546df8544dc\""
time="2024-07-08T22:12:03.888882992+01:00" level=info msg="shim disconnected" id=7f0de1c3f6d5e7a8eabc4da89c9b4df3fbee755bd9f0a53cc5d4d546df8544dc namespace=k8s.io
time="2024-07-08T22:12:03.888929760+01:00" level=warning msg="cleaning up after shim disconnected" id=7f0de1c3f6d5e7a8eabc4da89c9b4df3fbee755bd9f0a53cc5d4d546df8544dc namespace=k8s.io
time="2024-07-08T22:12:04.024555173+01:00" level=info msg="TearDown network for sandbox \"7f0de1c3f6d5e7a8eabc4da89c9b4df3fbee755bd9f0a53cc5d4d546df8544dc\" successfully"
time="2024-07-08T22:12:04.024579749+01:00" level=info msg="StopPodSandbox for \"7f0de1c3f6d5e7a8eabc4da89c9b4df3fbee755bd9f0a53cc5d4d546df8544dc\" returns successfully"
time="2024-07-08T22:12:04.748771828+01:00" level=info msg="StopPodSandbox for \"7f0de1c3f6d5e7a8eabc4da89c9b4df3fbee755bd9f0a53cc5d4d546df8544dc\""
time="2024-07-08T22:12:04.771453268+01:00" level=info msg="TearDown network for sandbox \"7f0de1c3f6d5e7a8eabc4da89c9b4df3fbee755bd9f0a53cc5d4d546df8544dc\" successfully"
time="2024-07-08T22:12:04.771471662+01:00" level=info msg="StopPodSandbox for \"7f0de1c3f6d5e7a8eabc4da89c9b4df3fbee755bd9f0a53cc5d4d546df8544dc\" returns successfully"
time="2024-07-08T22:13:25.936552697+01:00" level=info msg="StopPodSandbox for \"7f0de1c3f6d5e7a8eabc4da89c9b4df3fbee755bd9f0a53cc5d4d546df8544dc\""
time="2024-07-08T22:13:25.955115984+01:00" level=info msg="TearDown network for sandbox \"7f0de1c3f6d5e7a8eabc4da89c9b4df3fbee755bd9f0a53cc5d4d546df8544dc\" successfully"
time="2024-07-08T22:13:25.955143075+01:00" level=info msg="StopPodSandbox for \"7f0de1c3f6d5e7a8eabc4da89c9b4df3fbee755bd9f0a53cc5d4d546df8544dc\" returns successfully"
time="2024-07-08T22:13:26.933607807+01:00" level=info msg="StopPodSandbox for \"7f0de1c3f6d5e7a8eabc4da89c9b4df3fbee755bd9f0a53cc5d4d546df8544dc\""
time="2024-07-08T22:13:26.953210280+01:00" level=info msg="TearDown network for sandbox \"7f0de1c3f6d5e7a8eabc4da89c9b4df3fbee755bd9f0a53cc5d4d546df8544dc\" successfully"
time="2024-07-08T22:13:26.953226791+01:00" level=info msg="StopPodSandbox for \"7f0de1c3f6d5e7a8eabc4da89c9b4df3fbee755bd9f0a53cc5d4d546df8544dc\" returns successfully"
time="2024-07-08T22:13:36.266098176+01:00" level=info msg="StopPodSandbox for \"7f0de1c3f6d5e7a8eabc4da89c9b4df3fbee755bd9f0a53cc5d4d546df8544dc\""
time="2024-07-08T22:13:36.297595621+01:00" level=info msg="TearDown network for sandbox \"7f0de1c3f6d5e7a8eabc4da89c9b4df3fbee755bd9f0a53cc5d4d546df8544dc\" successfully"
time="2024-07-08T22:13:36.297611210+01:00" level=info msg="StopPodSandbox for \"7f0de1c3f6d5e7a8eabc4da89c9b4df3fbee755bd9f0a53cc5d4d546df8544dc\" returns successfully"
time="2024-07-08T22:13:36.297799182+01:00" level=info msg="RemovePodSandbox for \"7f0de1c3f6d5e7a8eabc4da89c9b4df3fbee755bd9f0a53cc5d4d546df8544dc\""
time="2024-07-08T22:13:36.297817045+01:00" level=info msg="Forcibly stopping sandbox \"7f0de1c3f6d5e7a8eabc4da89c9b4df3fbee755bd9f0a53cc5d4d546df8544dc\""
time="2024-07-08T22:13:36.316115176+01:00" level=info msg="TearDown network for sandbox \"7f0de1c3f6d5e7a8eabc4da89c9b4df3fbee755bd9f0a53cc5d4d546df8544dc\" successfully"
time="2024-07-08T22:13:36.398557138+01:00" level=warning msg="Failed to get podSandbox status for container event for sandboxID \"7f0de1c3f6d5e7a8eabc4da89c9b4df3fbee755bd9f0a53cc5d4d546df8544dc\": an error occurred when try to find sandbox: not found. Sending the event with nil podSandboxStatus."
time="2024-07-08T22:13:36.398630465+01:00" level=info msg="RemovePodSandbox \"7f0de1c3f6d5e7a8eabc4da89c9b4df3fbee755bd9f0a53cc5d4d546df8544dc\" returns successfully"
Additional information that might help better understand your environment and reproduce the bug:
- [ ] Docker version from
docker version - [ ] Docker command, image and tag used
- [ ] Kernel version from
uname -a - [ ] Any relevant kernel output lines from
dmesg
[ 8.435667] nvidia-nvlink: Nvlink Core is being initialized, major device number 243
[ 8.437103] nvidia 0000:04:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=io+mem
[ 8.554009] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 545.23.08 Mon Nov 6 23:49:37 UTC 2023
[ 8.581595] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 545.23.08 Mon Nov 6 23:23:07 UTC 2023
[ 8.586897] [drm] [nvidia-drm] [GPU ID 0x00000400] Loading driver
[ 8.587354] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:04:00.0 on minor 0
- [ ] NVIDIA packages version from
dpkg -l '*nvidia*'orrpm -qa '*nvidia*' - [ ] NVIDIA container library version from
nvidia-container-cli -V
root@truenas[.../ix-applications/k3s/agent/containerd]# nvidia-container-cli -V
cli-version: 1.14.3
lib-version: 1.14.3
build date: 2023-10-19T11:32+00:00
build revision: 1eb5a30a6ad0415550a9df632ac8832bf7e2bbba
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections