Device plugin does not start on MIG-enabled host due to insufficient permissions
1. Quick Debug Information
- OS/Version: Ubuntu20.04:
- Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): containerd
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): EKS 1.23
2. Issue or feature description
MIG manager runs through successfully, however subsequently gpu-feature-discovery and nvidia-device-plugin-daemonset weren't able to start. I'm using GPU operator.
3. Information to attach (optional if deemed irrelevant)
gpu-feature-discovery error logs:
I0322 03:45:26.435033 1 main.go:122] Starting OS watcher.
I0322 03:45:26.435190 1 main.go:127] Loading configuration.
I0322 03:45:26.435407 1 main.go:139]
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "single",
"failOnInitError": true,
"gdsEnabled": null,
"mofedEnabled": null,
"gfd": {
"oneshot": false,
"noTimestamp": false,
"sleepInterval": "1m0s",
"outputFile": "/etc/kubernetes/node-feature-discovery/features.d/gfd",
"machineTypeFile": "/sys/class/dmi/id/product_name"
}
},
"resources": {
"gpus": null
},
"sharing": {
"timeSlicing": {}
}
}
I0322 03:45:26.435751 1 factory.go:48] Detected NVML platform: found NVML library
I0322 03:45:26.435780 1 factory.go:48] Detected non-Tegra platform: /sys/devices/soc0/family file not found
I0322 03:45:26.435785 1 factory.go:64] Using NVML manager
I0322 03:45:26.435792 1 main.go:144] Start running
W0322 03:45:26.522503 1 main.go:161] Error removing output file: failed to remove output file: remove /etc/kubernetes/node-feature-discovery/features.d/gfd: no such file or directory
I0322 03:45:26.522512 1 main.go:119] Exiting
E0322 03:45:26.522523 1 main.go:95] error creating NVML labeler: error creating resource labeler: failed to construct GPU labeler: failed to construct labeler: failed to get memory info for device: Insufficient Permissions
device plugin error logs:
I0322 03:47:53.436006 1 main.go:154] Starting FS watcher.
I0322 03:47:53.436082 1 main.go:161] Starting OS watcher.
I0322 03:47:53.436353 1 main.go:176] Starting Plugins.
I0322 03:47:53.436370 1 main.go:234] Loading configuration.
I0322 03:47:53.436483 1 main.go:242] Updating config with default resource matching patterns.
I0322 03:47:53.436655 1 main.go:253]
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "single",
"failOnInitError": true,
"nvidiaDriverRoot": "/",
"gdsEnabled": false,
"mofedEnabled": false,
"plugin": {
"passDeviceSpecs": true,
"deviceListStrategy": [
"envvar"
],
"deviceIDStrategy": "uuid",
"cdiAnnotationPrefix": "cdi.k8s.io/",
"nvidiaCTKPath": "/usr/bin/nvidia-ctk",
"containerDriverRoot": "/driver-root"
}
},
"resources": {
"gpus": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
],
"mig": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
]
},
"sharing": {
"timeSlicing": {}
}
}
I0322 03:47:53.436664 1 main.go:256] Retreiving plugins.
I0322 03:47:53.437106 1 factory.go:107] Detected NVML platform: found NVML library
I0322 03:47:53.437140 1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0322 03:47:53.548180 1 main.go:123] error starting plugins: error getting plugins: failed to construct NVML resource managers: error building device map: error building device map from config.resources: error building MIG device map: error visiting devices: error visiting device: error visiting device: error visiting MIG device: error visiting MIG device: error getting MIG profile for MIG device at index '(0, 0)': error getting parent memory info: Insufficient Permissions
https://github.com/NVIDIA/k8s-device-plugin/issues/399 pointed to environment variable NVIDIA_MIG_MONITOR_DEVICES, it is set to all on the containers. Containers are started with privileged: true as securityContext.
device plugin is v0.14.0. MIG is enabled with nvidia.com/mig.config: all-3g.40gb and its state is nvidia.com/mig.config.state: success
@yunfeng-scale please don't override any versions manually during helm install, use defaults that come with the latest operator install. only --set migManager.env[0].name=WITH_REBOOT, --set-string migManager.env[0].value="true" option is required.
@yunfeng-scale please don't override any versions manually during helm install, use defaults that come with the latest operator install. only
--set migManager.env[0].name=WITH_REBOOT, --set-string migManager.env[0].value="true"option is required.
thanks for the suggestion, i'll give it a try
@shivamerla using the following command that it didn't work
helm upgrade gpu-operator \
nvidia/gpu-operator \
--namespace kube-system \
--set mig.strategy=single \
--set "migManager.env[0].name=WITH_REBOOT" \
--set-string "migManager.env[0].value=true" \
--set migManager.enabled=true