k8s-device-plugin icon indicating copy to clipboard operation
k8s-device-plugin copied to clipboard

GPU gets marked as unhealthy on systemctl daemon-reloads + kubelet restarts (on Kubernetes Upgrades)

Open sstrk opened this issue 2 years ago • 5 comments

1. Issue or feature description

During a Kubernetes upgrade, or more specifically on systemctl daemon-reload and a following kubelet restart, the nvdp-Pod marks the GPU as unhealthy, but does not crash or fail. The GPU becomes unavailable due to that. As the Pod just marks the GPU as unhealthy but keeps looking healthy and also keeps running, one can't detect that the GPU isn't working without checking the logs of the nvdp-Pod.

To fix the situtation, one can delete the corresponding nvdp-Pod. A new one will be created (as the nvdp-Pods are managed by a DaemonSet) and the GPU becomes available again.

We suppose to add a liveness probe to the nvdp-Pod which checks if the GPU is considered healthy and fails over if that is not the case. Although, we are not entirely sure which impact the restart of the nvdp-Pod has on existing workload.

2. Steps to reproduce the issue

  • Create a Kubernetes cluster (we confirmed this problem on K8s v1.24 and K8s v1.25) with GPU workers
  • Verify the nvdp-Pod is working
  • Jump on the GPU host
  • systemctl daemon-reload
  • systemctl restart kubelet
  • Check nvdp-Pod logs

3. Information to attach (optional if deemed irrelevant)

nvdp logs:

  }
}
I0724 06:37:04.083629       1 main.go:256] Retreiving plugins.
I0724 06:37:04.084401       1 factory.go:107] Detected NVML platform: found NVML library
I0724 06:37:04.084467       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
I0724 06:37:04.106831       1 server.go:165] Starting GRPC server for 'nvidia.com/gpu'
I0724 06:37:04.107703       1 server.go:117] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0724 06:37:04.110323       1 server.go:125] Registered device plugin for 'nvidia.com/gpu' with Kubelet
I0724 06:48:39.381063       1 main.go:202] inotify: /var/lib/kubelet/device-plugins/kubelet.sock created, restarting.
I0724 06:48:39.382317       1 main.go:294] Stopping plugins.
I0724 06:48:39.382373       1 server.go:142] Stopping to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0724 06:48:39.382576       1 main.go:176] Starting Plugins.
I0724 06:48:39.382587       1 main.go:234] Loading configuration.
I0724 06:48:39.382913       1 main.go:242] Updating config with default resource matching patterns.
I0724 06:48:39.383190       1 main.go:253] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": true,
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0724 06:48:39.383199       1 main.go:256] Retreiving plugins.
I0724 06:48:39.383365       1 factory.go:107] Detected NVML platform: found NVML library
I0724 06:48:39.383407       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
I0724 06:48:39.385318       1 server.go:165] Starting GRPC server for 'nvidia.com/gpu'
I0724 06:48:39.388256       1 server.go:117] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0724 06:48:39.395465       1 server.go:125] Registered device plugin for 'nvidia.com/gpu' with Kubelet
I0724 06:58:18.693207       1 main.go:202] inotify: /var/lib/kubelet/device-plugins/kubelet.sock created, restarting.
I0724 06:58:18.693347       1 main.go:294] Stopping plugins.
I0724 06:58:18.693390       1 server.go:142] Stopping to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0724 06:58:18.693627       1 main.go:176] Starting Plugins.
I0724 06:58:18.693655       1 main.go:234] Loading configuration.
I0724 06:58:18.694173       1 main.go:242] Updating config with default resource matching patterns.
I0724 06:58:18.694783       1 main.go:253] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": true,
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0724 06:58:18.694906       1 main.go:256] Retreiving plugins.
I0724 06:58:18.695315       1 factory.go:107] Detected NVML platform: found NVML library
I0724 06:58:18.695362       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
I0724 06:58:18.697166       1 server.go:165] Starting GRPC server for 'nvidia.com/gpu'
I0724 06:58:18.698602       1 server.go:117] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0724 06:58:18.706961       1 server.go:125] Registered device plugin for 'nvidia.com/gpu' with Kubelet
I0724 06:58:18.707468       1 health.go:126] Marking device GPU-aa8e5d5f-b2cd-6464-cbec-3bc0feaddce0 as unhealthy: Unknown Error
I0724 06:58:18.707579       1 server.go:245] 'nvidia.com/gpu' device marked unhealthy: GPU-aa8e5d5f-b2cd-6464-cbec-3bc0feaddce0

sstrk avatar Jul 24 '23 07:07 sstrk

Please see this notice from February: https://github.com/NVIDIA/nvidia-docker/issues/1730

The long term fix is to start using CDI instead of the legacy GPU container stack (which we have support for in the device plugin, but unfortunately it’s not we’ll documented yet).

klueska avatar Jul 24 '23 07:07 klueska

What I mention above should fix the underlying issue of the GPU going unhealthy. We are also looking to improve the error handling story of the plugin (as you suggest), but we don't have a concrete design in place yet. The overall idea will be to leverage dcgm (when available) and fallback to (improved) NVML monitoring with better visibility when something goes wrong.

klueska avatar Jul 24 '23 07:07 klueska

Hello @klueska ,

thanks for that quick response and sorry for my delayed answer!

Also thanks for pointing me to https://github.com/NVIDIA/nvidia-docker/issues/1730 , I was not aware of this issue. We'll investigate the suggested workarounds there. The only option for our specific case seems to be to switch to the GPU operator. The udev rule did not work out in a test scenario (although I have to admit there may be a problem on our side), we'll dig deeper. Downgrading containerd is no option for us due to compliance reasons.

Regarding the fourth option, disabling systemd cgroup management in containerd, we observed some time ago that his causes problem which lead to an unusable GPU in some circumstances: https://gitlab.com/yaook/k8s/-/commit/36dbbaeb6e091e8daa81ab892ed3fdf1bb1fe4d5

sstrk avatar Aug 03 '23 13:08 sstrk

@klueska I can confirm that the issue still exists with nvidia-container-toolkit==1.12.0-1 and symlinks present. We investigated this deeper and came to the following conclusions:

  1. There are two separate problems (causes) leading to unhealthy GPUs: a. reloading systemctl daemon (systemctl daemon-reload) with missing symlinks b. restarting kubelet service (even when symlinks are present)
  2. Problem "A" was fixed by executing sudo nvidia-ctk system create-dev-char-symlinks --create-all after 1.12 version of toolkit (as stated in NVIDIA/nvidia-docker#1730)
  3. Problem "B" is still reproducible by restarting kubelet via systemctl restart kubelet. It has the same effect for GPUs - they become unhealthy in existing containers and any subsequent call of nvidia-smi inside these containers leads to Unknown error. Same error messages as mentioned in the issue description are observed in nvidia-device-plugin pod.

Elentary avatar Sep 05 '23 14:09 Elentary

This seems to be fixed in 1.29 k8s version when running the deviceplugin(latest version) with PASS_DEVICE_SPECS and privileged security context. securityContext: privileged: true

Getting the following error with out privileged security context: I0721 08:33:33.031529 1 health.go:125] Marking device GPU-49289966-6d48-6232-8557-da2a26a62fe0 as unhealthy: Insufficient Permissions The above error is from gpu.RegisterEvents() call during checkHealth. Not sure why permission issues happen only after kubelet restart and not during the initial device plugin pod startup.

rppala90 avatar Jul 22 '24 04:07 rppala90

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

github-actions[bot] avatar Feb 11 '25 04:02 github-actions[bot]