k8s-device-plugin nvidia-device-plugin-daemonset CrashLoopBackoff in Truenas Scale Dragonfish

nvidia-device-plugin-daemonset CrashLoopBackoff in Truenas Scale Dragonfish

Open jmkgreen opened this issue 1 year ago • 8 comments

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

1. Quick Debug Information

OS/Version(e.g. RHEL8.6, Ubuntu22.04): Truenas Scale Dragonfish-24.04.1.1
Kernel Version: 6.6.29-production+truenas
Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): containerd
K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): k3s

2. Issue or feature description

Attempting to use a P2000 card as a hardware transcoder, the hardware is listed for VM isolation purposes but not for container sharing. It seems the pod nvidia-device-plugin-daemonset-XXXXX is being restarted.. by something unknown.

I've approached the wider Truenas scale community - a number of people are reporting the problem with the latest scale version yet others are fine. I filed a bug with ix-systems who promptly pointed me back to the forums (my post). I'm quite happy to debug my system but my history is Ubuntu/Docker and so help is needed with K3S/K8S here. Tell me what to type and I'll feed back.

3. Information to attach (optional if deemed irrelevant)

Common error checking:

[ ] The output of nvidia-smi -a on your host

root@truenas[~]# nvidia-smi -a

==============NVSMI LOG==============

Timestamp                                 : Mon Jul  8 23:06:00 2024
Driver Version                            : 545.23.08
CUDA Version                              : 12.3

Attached GPUs                             : 1
GPU 00000000:04:00.0
   Product Name                          : Quadro P2000
   Product Brand                         : Quadro
   Product Architecture                  : Pascal
   Display Mode                          : Disabled
   Display Active                        : Disabled
   Persistence Mode                      : Disabled
   Addressing Mode                       : N/A
   MIG Mode
       Current                           : N/A
       Pending                           : N/A
   Accounting Mode                       : Disabled
   Accounting Mode Buffer Size           : 4000
   Driver Model
       Current                           : N/A
       Pending                           : N/A

Let me know if you need further.

[ ] Your docker configuration file (e.g: /etc/docker/daemon.json) This might be the relevant file here:

root@truenas[...applications/k3s/agent/etc/containerd]# cat config.toml 

# File generated by k3s. DO NOT EDIT. Use config.toml.tmpl instead.
version = 2

[plugins."io.containerd.internal.v1.opt"]
  path = "/mnt/orion/ix-applications/k3s/agent/containerd"
[plugins."io.containerd.grpc.v1.cri"]
  stream_server_address = "127.0.0.1"
  stream_server_port = "10010"
  enable_selinux = false
  enable_unprivileged_ports = true
  enable_unprivileged_icmp = true
  sandbox_image = "rancher/mirrored-pause:3.6"

[plugins."io.containerd.grpc.v1.cri".containerd]
  snapshotter = "overlayfs"
  disable_snapshot_annotations = true



[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
  SystemdCgroup = true









[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
  runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
  BinaryName = "/usr/bin/nvidia-container-runtime"

[ ] The k8s-device-plugin container logs

truenas% sudo k3s kubectl logs -p pod/nvidia-device-plugin-daemonset-k58cb --namespace=kube-system
2024/07/04 20:22:39 Starting FS watcher.
2024/07/04 20:22:39 Starting OS watcher.
2024/07/04 20:22:39 Starting Plugins.
2024/07/04 20:22:39 Loading configuration.
2024/07/04 20:22:39 Updating config with default resource matching patterns.
2024/07/04 20:22:39 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": true,
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": "envvar",
      "deviceIDStrategy": "uuid"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {
      "resources": [
        {
          "name": "nvidia.com/gpu",
          "devices": "all",
          "replicas": 5
        }
      ]
    }
  }
}
2024/07/04 20:22:39 Retreiving plugins.
2024/07/04 20:22:39 Detected NVML platform: found NVML library
2024/07/04 20:22:39 Detected non-Tegra platform: /sys/devices/soc0/family file not found
2024/07/04 20:22:39 Starting GRPC server for 'nvidia.com/gpu'
2024/07/04 20:22:39 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2024/07/04 20:22:39 Registered device plugin for 'nvidia.com/gpu' with Kubelet
2024/07/04 20:23:39 Received signal "terminated", shutting down.
2024/07/04 20:23:39 Stopping plugins.
2024/07/04 20:23:39 Stopping to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock

[ ] The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet) This appears to repeat itself:

root@truenas[.../ix-applications/k3s/agent/containerd]# grep f19967bd-8ec3-4227-9449-9e5a51dc752e containerd.log 
time="2024-07-08T22:11:43.936476667+01:00" level=info msg="RunPodSandbox for &PodSandboxMetadata{Name:nvidia-device-plugin-daemonset-bdx5d,Uid:f19967bd-8ec3-4227-9449-9e5a51dc752e,Namespace:kube-system,Attempt:0,}"
I0708 22:11:44.249718    7550 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"kube-system", Name:"nvidia-device-plugin-daemonset-bdx5d", UID:"f19967bd-8ec3-4227-9449-9e5a51dc752e", APIVersion:"v1", ResourceVersion:"545238", FieldPath:""}): type: 'Normal' reason: 'AddedInterface' Add eth0 [172.16.5.69/16] from ix-net
time="2024-07-08T22:11:45.347893032+01:00" level=info msg="RunPodSandbox for &PodSandboxMetadata{Name:nvidia-device-plugin-daemonset-bdx5d,Uid:f19967bd-8ec3-4227-9449-9e5a51dc752e,Namespace:kube-system,Attempt:0,} returns sandbox id \"7f0de1c3f6d5e7a8eabc4da89c9b4df3fbee755bd9f0a53cc5d4d546df8544dc\""

Alongside:

root@truenas[.../ix-applications/k3s/agent/containerd]# grep 7f0de1c3f6d5e7a8eabc4da89c9b4df3fbee755bd9f0a53cc5d4d546df8544dc containerd.log 
time="2024-07-08T22:11:45.347893032+01:00" level=info msg="RunPodSandbox for &PodSandboxMetadata{Name:nvidia-device-plugin-daemonset-bdx5d,Uid:f19967bd-8ec3-4227-9449-9e5a51dc752e,Namespace:kube-system,Attempt:0,} returns sandbox id \"7f0de1c3f6d5e7a8eabc4da89c9b4df3fbee755bd9f0a53cc5d4d546df8544dc\""
time="2024-07-08T22:11:45.349271612+01:00" level=info msg="CreateContainer within sandbox \"7f0de1c3f6d5e7a8eabc4da89c9b4df3fbee755bd9f0a53cc5d4d546df8544dc\" for container &ContainerMetadata{Name:nvidia-device-plugin-ctr,Attempt:0,}"
time="2024-07-08T22:11:46.443324067+01:00" level=info msg="CreateContainer within sandbox \"7f0de1c3f6d5e7a8eabc4da89c9b4df3fbee755bd9f0a53cc5d4d546df8544dc\" for &ContainerMetadata{Name:nvidia-device-plugin-ctr,Attempt:0,} returns container id \"3609fcda6afb625b8a195815bf146a67563f22be0ea4cf01ac3f1722368709fb\""
time="2024-07-08T22:11:58.839012991+01:00" level=info msg="StopPodSandbox for \"7f0de1c3f6d5e7a8eabc4da89c9b4df3fbee755bd9f0a53cc5d4d546df8544dc\""
time="2024-07-08T22:12:03.888882992+01:00" level=info msg="shim disconnected" id=7f0de1c3f6d5e7a8eabc4da89c9b4df3fbee755bd9f0a53cc5d4d546df8544dc namespace=k8s.io
time="2024-07-08T22:12:03.888929760+01:00" level=warning msg="cleaning up after shim disconnected" id=7f0de1c3f6d5e7a8eabc4da89c9b4df3fbee755bd9f0a53cc5d4d546df8544dc namespace=k8s.io
time="2024-07-08T22:12:04.024555173+01:00" level=info msg="TearDown network for sandbox \"7f0de1c3f6d5e7a8eabc4da89c9b4df3fbee755bd9f0a53cc5d4d546df8544dc\" successfully"
time="2024-07-08T22:12:04.024579749+01:00" level=info msg="StopPodSandbox for \"7f0de1c3f6d5e7a8eabc4da89c9b4df3fbee755bd9f0a53cc5d4d546df8544dc\" returns successfully"
time="2024-07-08T22:12:04.748771828+01:00" level=info msg="StopPodSandbox for \"7f0de1c3f6d5e7a8eabc4da89c9b4df3fbee755bd9f0a53cc5d4d546df8544dc\""
time="2024-07-08T22:12:04.771453268+01:00" level=info msg="TearDown network for sandbox \"7f0de1c3f6d5e7a8eabc4da89c9b4df3fbee755bd9f0a53cc5d4d546df8544dc\" successfully"
time="2024-07-08T22:12:04.771471662+01:00" level=info msg="StopPodSandbox for \"7f0de1c3f6d5e7a8eabc4da89c9b4df3fbee755bd9f0a53cc5d4d546df8544dc\" returns successfully"
time="2024-07-08T22:13:25.936552697+01:00" level=info msg="StopPodSandbox for \"7f0de1c3f6d5e7a8eabc4da89c9b4df3fbee755bd9f0a53cc5d4d546df8544dc\""
time="2024-07-08T22:13:25.955115984+01:00" level=info msg="TearDown network for sandbox \"7f0de1c3f6d5e7a8eabc4da89c9b4df3fbee755bd9f0a53cc5d4d546df8544dc\" successfully"
time="2024-07-08T22:13:25.955143075+01:00" level=info msg="StopPodSandbox for \"7f0de1c3f6d5e7a8eabc4da89c9b4df3fbee755bd9f0a53cc5d4d546df8544dc\" returns successfully"
time="2024-07-08T22:13:26.933607807+01:00" level=info msg="StopPodSandbox for \"7f0de1c3f6d5e7a8eabc4da89c9b4df3fbee755bd9f0a53cc5d4d546df8544dc\""
time="2024-07-08T22:13:26.953210280+01:00" level=info msg="TearDown network for sandbox \"7f0de1c3f6d5e7a8eabc4da89c9b4df3fbee755bd9f0a53cc5d4d546df8544dc\" successfully"
time="2024-07-08T22:13:26.953226791+01:00" level=info msg="StopPodSandbox for \"7f0de1c3f6d5e7a8eabc4da89c9b4df3fbee755bd9f0a53cc5d4d546df8544dc\" returns successfully"
time="2024-07-08T22:13:36.266098176+01:00" level=info msg="StopPodSandbox for \"7f0de1c3f6d5e7a8eabc4da89c9b4df3fbee755bd9f0a53cc5d4d546df8544dc\""
time="2024-07-08T22:13:36.297595621+01:00" level=info msg="TearDown network for sandbox \"7f0de1c3f6d5e7a8eabc4da89c9b4df3fbee755bd9f0a53cc5d4d546df8544dc\" successfully"
time="2024-07-08T22:13:36.297611210+01:00" level=info msg="StopPodSandbox for \"7f0de1c3f6d5e7a8eabc4da89c9b4df3fbee755bd9f0a53cc5d4d546df8544dc\" returns successfully"
time="2024-07-08T22:13:36.297799182+01:00" level=info msg="RemovePodSandbox for \"7f0de1c3f6d5e7a8eabc4da89c9b4df3fbee755bd9f0a53cc5d4d546df8544dc\""
time="2024-07-08T22:13:36.297817045+01:00" level=info msg="Forcibly stopping sandbox \"7f0de1c3f6d5e7a8eabc4da89c9b4df3fbee755bd9f0a53cc5d4d546df8544dc\""
time="2024-07-08T22:13:36.316115176+01:00" level=info msg="TearDown network for sandbox \"7f0de1c3f6d5e7a8eabc4da89c9b4df3fbee755bd9f0a53cc5d4d546df8544dc\" successfully"
time="2024-07-08T22:13:36.398557138+01:00" level=warning msg="Failed to get podSandbox status for container event for sandboxID \"7f0de1c3f6d5e7a8eabc4da89c9b4df3fbee755bd9f0a53cc5d4d546df8544dc\": an error occurred when try to find sandbox: not found. Sending the event with nil podSandboxStatus."
time="2024-07-08T22:13:36.398630465+01:00" level=info msg="RemovePodSandbox \"7f0de1c3f6d5e7a8eabc4da89c9b4df3fbee755bd9f0a53cc5d4d546df8544dc\" returns successfully"

Additional information that might help better understand your environment and reproduce the bug:

[ ] Docker version from docker version
[ ] Docker command, image and tag used
[ ] Kernel version from uname -a
[ ] Any relevant kernel output lines from dmesg

[    8.435667] nvidia-nvlink: Nvlink Core is being initialized, major device number 243

[    8.437103] nvidia 0000:04:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=io+mem
[    8.554009] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  545.23.08  Mon Nov  6 23:49:37 UTC 2023
[    8.581595] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  545.23.08  Mon Nov  6 23:23:07 UTC 2023
[    8.586897] [drm] [nvidia-drm] [GPU ID 0x00000400] Loading driver
[    8.587354] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:04:00.0 on minor 0

[ ] NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
[ ] NVIDIA container library version from nvidia-container-cli -V

root@truenas[.../ix-applications/k3s/agent/containerd]# nvidia-container-cli -V
cli-version: 1.14.3
lib-version: 1.14.3
build date: 2023-10-19T11:32+00:00
build revision: 1eb5a30a6ad0415550a9df632ac8832bf7e2bbba
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

Jul 08 '24 22:07 jmkgreen

k8s-device-plugin k8s-device-plugin copied to clipboard

nvidia-device-plugin-daemonset CrashLoopBackoff in Truenas Scale Dragonfish

1. Quick Debug Information

2. Issue or feature description

3. Information to attach (optional if deemed irrelevant)

k8s-device-plugin
k8s-device-plugin copied to clipboard