k8s-device-plugin icon indicating copy to clipboard operation
k8s-device-plugin copied to clipboard

Device plugin does not start on MIG-enabled host due to insufficient permissions

Open yunfeng-scale opened this issue 1 year ago • 5 comments

1. Quick Debug Information

  • OS/Version: Ubuntu20.04:
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): containerd
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): EKS 1.23

2. Issue or feature description

MIG manager runs through successfully, however subsequently gpu-feature-discovery and nvidia-device-plugin-daemonset weren't able to start. I'm using GPU operator.

3. Information to attach (optional if deemed irrelevant)

device plugin error logs:

I0322 03:47:53.436006       1 main.go:154] Starting FS watcher.
I0322 03:47:53.436082       1 main.go:161] Starting OS watcher.
I0322 03:47:53.436353       1 main.go:176] Starting Plugins.
I0322 03:47:53.436370       1 main.go:234] Loading configuration.
I0322 03:47:53.436483       1 main.go:242] Updating config with default resource matching patterns.
I0322 03:47:53.436655       1 main.go:253] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "single",
    "failOnInitError": true,
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "plugin": {
      "passDeviceSpecs": true,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ],
    "mig": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0322 03:47:53.436664       1 main.go:256] Retreiving plugins.
I0322 03:47:53.437106       1 factory.go:107] Detected NVML platform: found NVML library
I0322 03:47:53.437140       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0322 03:47:53.548180       1 main.go:123] error starting plugins: error getting plugins: failed to construct NVML resource managers: error building device map: error building device map from config.resources: error building MIG device map: error visiting devices: error visiting device: error visiting device: error visiting MIG device: error visiting MIG device: error getting MIG profile for MIG device at index '(0, 0)': error getting parent memory info: Insufficient Permissions

https://github.com/NVIDIA/k8s-device-plugin/issues/399 pointed to environment variable NVIDIA_MIG_MONITOR_DEVICES, it is set to all on the container. Container is started with privileged: true as securityContext.

device plugin is v0.14.0. MIG is enabled with nvidia.com/mig.config: all-3g.40gb and its state is nvidia.com/mig.config.state: success

yunfeng-scale avatar Mar 22 '24 04:03 yunfeng-scale

This definitely points to NVIDIA_MIG_MONITOR_DEVICES not being set correctly. Can you verify that this setting is actually being picked up in the container? Meaning, exec into the container can run export to observe the envvars set.

klueska avatar Mar 23 '24 16:03 klueska

@yunfeng-scale since you mention the GPU Operator being used, could you please confirm the GPU Operator version that is being used to deploy the v0.14.0 version of the device plugin?

elezar avatar Mar 25 '24 08:03 elezar

GPU Operator version is 22.9.1, driver version is 470.161.03. will circle back on checking NVIDIA_MIG_MONITOR_DEVICES next week

yunfeng-scale avatar Mar 29 '24 23:03 yunfeng-scale

This definitely points to NVIDIA_MIG_MONITOR_DEVICES not being set correctly. Can you verify that this setting is actually being picked up in the container? Meaning, exec into the container can run export to observe the envvars set.

sorry for the late reply. @klueska yes i can confirm this set correctly by getting into a container and grep env vars.

also using the latest GPU operator 23.9.2 the problem persists

yunfeng-scale avatar Apr 26 '24 01:04 yunfeng-scale

tried to install gpu-feature-discovery from its own helm chart (removing it from gpu operator) and that didn't work either

also can you help me understand what sets the permissions based on env var NVIDIA_MIG_MONITOR_DEVICES, so I may able to do some investigations?

yunfeng-scale avatar Apr 26 '24 05:04 yunfeng-scale

for others encountering the same issue: we upgraded EKS from 1.23 to 1.29 and the permission issue is resolved.

yunfeng-scale avatar May 08 '24 17:05 yunfeng-scale