gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

Device plugin does not start on MIG-enabled host due to insufficient permissions

Open yunfeng-scale opened this issue 1 year ago • 2 comments

1. Quick Debug Information

  • OS/Version: Ubuntu20.04:
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): containerd
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): EKS 1.23

2. Issue or feature description

MIG manager runs through successfully, however subsequently gpu-feature-discovery and nvidia-device-plugin-daemonset weren't able to start. I'm using GPU operator.

3. Information to attach (optional if deemed irrelevant)

gpu-feature-discovery error logs:

I0322 03:45:26.435033       1 main.go:122] Starting OS watcher.
I0322 03:45:26.435190       1 main.go:127] Loading configuration.
I0322 03:45:26.435407       1 main.go:139] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "single",
    "failOnInitError": true,
    "gdsEnabled": null,
    "mofedEnabled": null,
    "gfd": {
      "oneshot": false,
      "noTimestamp": false,
      "sleepInterval": "1m0s",
      "outputFile": "/etc/kubernetes/node-feature-discovery/features.d/gfd",
      "machineTypeFile": "/sys/class/dmi/id/product_name"
    }
  },
  "resources": {
    "gpus": null
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0322 03:45:26.435751       1 factory.go:48] Detected NVML platform: found NVML library
I0322 03:45:26.435780       1 factory.go:48] Detected non-Tegra platform: /sys/devices/soc0/family file not found
I0322 03:45:26.435785       1 factory.go:64] Using NVML manager
I0322 03:45:26.435792       1 main.go:144] Start running
W0322 03:45:26.522503       1 main.go:161] Error removing output file: failed to remove output file: remove /etc/kubernetes/node-feature-discovery/features.d/gfd: no such file or directory
I0322 03:45:26.522512       1 main.go:119] Exiting
E0322 03:45:26.522523       1 main.go:95] error creating NVML labeler: error creating resource labeler: failed to construct GPU labeler: failed to construct labeler: failed to get memory info for device: Insufficient Permissions

device plugin error logs:

I0322 03:47:53.436006       1 main.go:154] Starting FS watcher.
I0322 03:47:53.436082       1 main.go:161] Starting OS watcher.
I0322 03:47:53.436353       1 main.go:176] Starting Plugins.
I0322 03:47:53.436370       1 main.go:234] Loading configuration.
I0322 03:47:53.436483       1 main.go:242] Updating config with default resource matching patterns.
I0322 03:47:53.436655       1 main.go:253] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "single",
    "failOnInitError": true,
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "plugin": {
      "passDeviceSpecs": true,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ],
    "mig": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0322 03:47:53.436664       1 main.go:256] Retreiving plugins.
I0322 03:47:53.437106       1 factory.go:107] Detected NVML platform: found NVML library
I0322 03:47:53.437140       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0322 03:47:53.548180       1 main.go:123] error starting plugins: error getting plugins: failed to construct NVML resource managers: error building device map: error building device map from config.resources: error building MIG device map: error visiting devices: error visiting device: error visiting device: error visiting MIG device: error visiting MIG device: error getting MIG profile for MIG device at index '(0, 0)': error getting parent memory info: Insufficient Permissions

https://github.com/NVIDIA/k8s-device-plugin/issues/399 pointed to environment variable NVIDIA_MIG_MONITOR_DEVICES, it is set to all on the containers. Containers are started with privileged: true as securityContext.

device plugin is v0.14.0. MIG is enabled with nvidia.com/mig.config: all-3g.40gb and its state is nvidia.com/mig.config.state: success

yunfeng-scale avatar Mar 22 '24 04:03 yunfeng-scale

@yunfeng-scale please don't override any versions manually during helm install, use defaults that come with the latest operator install. only --set migManager.env[0].name=WITH_REBOOT, --set-string migManager.env[0].value="true" option is required.

shivamerla avatar Mar 25 '24 23:03 shivamerla

@yunfeng-scale please don't override any versions manually during helm install, use defaults that come with the latest operator install. only --set migManager.env[0].name=WITH_REBOOT, --set-string migManager.env[0].value="true" option is required.

thanks for the suggestion, i'll give it a try

yunfeng-scale avatar Mar 29 '24 23:03 yunfeng-scale

@shivamerla using the following command that it didn't work

helm upgrade gpu-operator \
   nvidia/gpu-operator \
   --namespace kube-system \
   --set mig.strategy=single \
   --set "migManager.env[0].name=WITH_REBOOT" \
   --set-string "migManager.env[0].value=true" \
   --set migManager.enabled=true

yunfeng-scale avatar Apr 26 '24 03:04 yunfeng-scale