k8s-device-plugin
k8s-device-plugin copied to clipboard
GPU gets marked as unhealthy on systemctl daemon-reloads + kubelet restarts (on Kubernetes Upgrades)
1. Issue or feature description
During a Kubernetes upgrade, or more specifically on systemctl daemon-reload and a following kubelet restart, the nvdp-Pod marks the GPU as unhealthy, but does not crash or fail.
The GPU becomes unavailable due to that.
As the Pod just marks the GPU as unhealthy but keeps looking healthy and also keeps running, one can't detect that the GPU isn't working without checking the logs of the nvdp-Pod.
To fix the situtation, one can delete the corresponding nvdp-Pod. A new one will be created (as the nvdp-Pods are managed by a DaemonSet) and the GPU becomes available again.
We suppose to add a liveness probe to the nvdp-Pod which checks if the GPU is considered healthy and fails over if that is not the case. Although, we are not entirely sure which impact the restart of the nvdp-Pod has on existing workload.
2. Steps to reproduce the issue
- Create a Kubernetes cluster (we confirmed this problem on K8s v1.24 and K8s v1.25) with GPU workers
- Verify the nvdp-Pod is working
- Jump on the GPU host
systemctl daemon-reloadsystemctl restart kubelet- Check nvdp-Pod logs
3. Information to attach (optional if deemed irrelevant)
nvdp logs:
}
}
I0724 06:37:04.083629 1 main.go:256] Retreiving plugins.
I0724 06:37:04.084401 1 factory.go:107] Detected NVML platform: found NVML library
I0724 06:37:04.084467 1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
I0724 06:37:04.106831 1 server.go:165] Starting GRPC server for 'nvidia.com/gpu'
I0724 06:37:04.107703 1 server.go:117] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0724 06:37:04.110323 1 server.go:125] Registered device plugin for 'nvidia.com/gpu' with Kubelet
I0724 06:48:39.381063 1 main.go:202] inotify: /var/lib/kubelet/device-plugins/kubelet.sock created, restarting.
I0724 06:48:39.382317 1 main.go:294] Stopping plugins.
I0724 06:48:39.382373 1 server.go:142] Stopping to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0724 06:48:39.382576 1 main.go:176] Starting Plugins.
I0724 06:48:39.382587 1 main.go:234] Loading configuration.
I0724 06:48:39.382913 1 main.go:242] Updating config with default resource matching patterns.
I0724 06:48:39.383190 1 main.go:253]
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "none",
"failOnInitError": true,
"nvidiaDriverRoot": "/",
"gdsEnabled": false,
"mofedEnabled": false,
"plugin": {
"passDeviceSpecs": false,
"deviceListStrategy": [
"envvar"
],
"deviceIDStrategy": "uuid",
"cdiAnnotationPrefix": "cdi.k8s.io/",
"nvidiaCTKPath": "/usr/bin/nvidia-ctk",
"containerDriverRoot": "/driver-root"
}
},
"resources": {
"gpus": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
]
},
"sharing": {
"timeSlicing": {}
}
}
I0724 06:48:39.383199 1 main.go:256] Retreiving plugins.
I0724 06:48:39.383365 1 factory.go:107] Detected NVML platform: found NVML library
I0724 06:48:39.383407 1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
I0724 06:48:39.385318 1 server.go:165] Starting GRPC server for 'nvidia.com/gpu'
I0724 06:48:39.388256 1 server.go:117] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0724 06:48:39.395465 1 server.go:125] Registered device plugin for 'nvidia.com/gpu' with Kubelet
I0724 06:58:18.693207 1 main.go:202] inotify: /var/lib/kubelet/device-plugins/kubelet.sock created, restarting.
I0724 06:58:18.693347 1 main.go:294] Stopping plugins.
I0724 06:58:18.693390 1 server.go:142] Stopping to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0724 06:58:18.693627 1 main.go:176] Starting Plugins.
I0724 06:58:18.693655 1 main.go:234] Loading configuration.
I0724 06:58:18.694173 1 main.go:242] Updating config with default resource matching patterns.
I0724 06:58:18.694783 1 main.go:253]
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "none",
"failOnInitError": true,
"nvidiaDriverRoot": "/",
"gdsEnabled": false,
"mofedEnabled": false,
"plugin": {
"passDeviceSpecs": false,
"deviceListStrategy": [
"envvar"
],
"deviceIDStrategy": "uuid",
"cdiAnnotationPrefix": "cdi.k8s.io/",
"nvidiaCTKPath": "/usr/bin/nvidia-ctk",
"containerDriverRoot": "/driver-root"
}
},
"resources": {
"gpus": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
]
},
"sharing": {
"timeSlicing": {}
}
}
I0724 06:58:18.694906 1 main.go:256] Retreiving plugins.
I0724 06:58:18.695315 1 factory.go:107] Detected NVML platform: found NVML library
I0724 06:58:18.695362 1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
I0724 06:58:18.697166 1 server.go:165] Starting GRPC server for 'nvidia.com/gpu'
I0724 06:58:18.698602 1 server.go:117] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0724 06:58:18.706961 1 server.go:125] Registered device plugin for 'nvidia.com/gpu' with Kubelet
I0724 06:58:18.707468 1 health.go:126] Marking device GPU-aa8e5d5f-b2cd-6464-cbec-3bc0feaddce0 as unhealthy: Unknown Error
I0724 06:58:18.707579 1 server.go:245] 'nvidia.com/gpu' device marked unhealthy: GPU-aa8e5d5f-b2cd-6464-cbec-3bc0feaddce0
Please see this notice from February: https://github.com/NVIDIA/nvidia-docker/issues/1730
The long term fix is to start using CDI instead of the legacy GPU container stack (which we have support for in the device plugin, but unfortunately it’s not we’ll documented yet).
What I mention above should fix the underlying issue of the GPU going unhealthy. We are also looking to improve the error handling story of the plugin (as you suggest), but we don't have a concrete design in place yet. The overall idea will be to leverage dcgm (when available) and fallback to (improved) NVML monitoring with better visibility when something goes wrong.
Hello @klueska ,
thanks for that quick response and sorry for my delayed answer!
Also thanks for pointing me to https://github.com/NVIDIA/nvidia-docker/issues/1730 , I was not aware of this issue. We'll investigate the suggested workarounds there. The only option for our specific case seems to be to switch to the GPU operator. The udev rule did not work out in a test scenario (although I have to admit there may be a problem on our side), we'll dig deeper. Downgrading containerd is no option for us due to compliance reasons.
Regarding the fourth option, disabling systemd cgroup management in containerd, we observed some time ago that his causes problem which lead to an unusable GPU in some circumstances: https://gitlab.com/yaook/k8s/-/commit/36dbbaeb6e091e8daa81ab892ed3fdf1bb1fe4d5
@klueska I can confirm that the issue still exists with nvidia-container-toolkit==1.12.0-1 and symlinks present.
We investigated this deeper and came to the following conclusions:
- There are two separate problems (causes) leading to unhealthy GPUs:
a. reloading systemctl daemon (
systemctl daemon-reload) with missing symlinks b. restarting kubelet service (even when symlinks are present) - Problem "A" was fixed by executing
sudo nvidia-ctk system create-dev-char-symlinks --create-allafter1.12version of toolkit (as stated in NVIDIA/nvidia-docker#1730) - Problem "B" is still reproducible by restarting kubelet via
systemctl restart kubelet. It has the same effect for GPUs - they become unhealthy in existing containers and any subsequent call ofnvidia-smiinside these containers leads toUnknown error. Same error messages as mentioned in the issue description are observed innvidia-device-pluginpod.
This seems to be fixed in 1.29 k8s version when running the deviceplugin(latest version) with PASS_DEVICE_SPECS and privileged security context.
securityContext: privileged: true
Getting the following error with out privileged security context:
I0721 08:33:33.031529 1 health.go:125] Marking device GPU-49289966-6d48-6232-8557-da2a26a62fe0 as unhealthy: Insufficient Permissions
The above error is from gpu.RegisterEvents() call during checkHealth.
Not sure why permission issues happen only after kubelet restart and not during the initial device plugin pod startup.
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.