Device plugin does not start on MIG-enabled host due to insufficient permissions
1. Quick Debug Information
- OS/Version: Ubuntu20.04:
- Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): containerd
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): EKS 1.23
2. Issue or feature description
MIG manager runs through successfully, however subsequently gpu-feature-discovery and nvidia-device-plugin-daemonset weren't able to start. I'm using GPU operator.
3. Information to attach (optional if deemed irrelevant)
device plugin error logs:
I0322 03:47:53.436006 1 main.go:154] Starting FS watcher.
I0322 03:47:53.436082 1 main.go:161] Starting OS watcher.
I0322 03:47:53.436353 1 main.go:176] Starting Plugins.
I0322 03:47:53.436370 1 main.go:234] Loading configuration.
I0322 03:47:53.436483 1 main.go:242] Updating config with default resource matching patterns.
I0322 03:47:53.436655 1 main.go:253]
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "single",
"failOnInitError": true,
"nvidiaDriverRoot": "/",
"gdsEnabled": false,
"mofedEnabled": false,
"plugin": {
"passDeviceSpecs": true,
"deviceListStrategy": [
"envvar"
],
"deviceIDStrategy": "uuid",
"cdiAnnotationPrefix": "cdi.k8s.io/",
"nvidiaCTKPath": "/usr/bin/nvidia-ctk",
"containerDriverRoot": "/driver-root"
}
},
"resources": {
"gpus": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
],
"mig": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
]
},
"sharing": {
"timeSlicing": {}
}
}
I0322 03:47:53.436664 1 main.go:256] Retreiving plugins.
I0322 03:47:53.437106 1 factory.go:107] Detected NVML platform: found NVML library
I0322 03:47:53.437140 1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0322 03:47:53.548180 1 main.go:123] error starting plugins: error getting plugins: failed to construct NVML resource managers: error building device map: error building device map from config.resources: error building MIG device map: error visiting devices: error visiting device: error visiting device: error visiting MIG device: error visiting MIG device: error getting MIG profile for MIG device at index '(0, 0)': error getting parent memory info: Insufficient Permissions
https://github.com/NVIDIA/k8s-device-plugin/issues/399 pointed to environment variable NVIDIA_MIG_MONITOR_DEVICES, it is set to all on the container. Container is started with privileged: true as securityContext.
device plugin is v0.14.0. MIG is enabled with nvidia.com/mig.config: all-3g.40gb and its state is nvidia.com/mig.config.state: success
This definitely points to NVIDIA_MIG_MONITOR_DEVICES not being set correctly. Can you verify that this setting is actually being picked up in the container? Meaning, exec into the container can run export to observe the envvars set.
@yunfeng-scale since you mention the GPU Operator being used, could you please confirm the GPU Operator version that is being used to deploy the v0.14.0 version of the device plugin?
GPU Operator version is 22.9.1, driver version is 470.161.03. will circle back on checking NVIDIA_MIG_MONITOR_DEVICES next week
This definitely points to
NVIDIA_MIG_MONITOR_DEVICESnot being set correctly. Can you verify that this setting is actually being picked up in the container? Meaning, exec into the container can runexportto observe the envvars set.
sorry for the late reply. @klueska yes i can confirm this set correctly by getting into a container and grep env vars.
also using the latest GPU operator 23.9.2 the problem persists
tried to install gpu-feature-discovery from its own helm chart (removing it from gpu operator) and that didn't work either
also can you help me understand what sets the permissions based on env var NVIDIA_MIG_MONITOR_DEVICES, so I may able to do some investigations?
for others encountering the same issue: we upgraded EKS from 1.23 to 1.29 and the permission issue is resolved.