k8s-device-plugin icon indicating copy to clipboard operation
k8s-device-plugin copied to clipboard

GKE support

Open lmyslinski opened this issue 2 years ago • 3 comments

Hi, I'm trying to use gpu-feature-discovery on GKE, however I'm having troubles getting it to work.

  1. Initially the daemonset fails to create the discovery pods, as they require system-critical priorityClassName, which by default is only allowed in the kube-system namespace. That is easy to fix with creating a custom ResourceQuota

Once the feature discovery pod starts, it fails to find the NVML library:

I0524 15:35:55.517312       1 main.go:122] Starting OS watcher.
I0524 15:35:55.517547       1 main.go:127] Loading configuration.
I0524 15:35:55.517817       1 main.go:139]
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": true,
    "gdsEnabled": null,
    "mofedEnabled": null,
    "gfd": {
      "oneshot": false,
      "noTimestamp": false,
      "sleepInterval": "1m0s",
      "outputFile": "/etc/kubernetes/node-feature-discovery/features.d/gfd",
      "machineTypeFile": "/sys/class/dmi/id/product_name"
    }
  },
  "resources": {
    "gpus": null
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0524 15:35:55.518183       1 factory.go:48] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0524 15:35:55.518224       1 factory.go:48] Detected non-Tegra platform: /sys/devices/soc0/family file not found
W0524 15:35:55.518252       1 factory.go:71] No valid resources detected; using empty manager.
I0524 15:35:55.518263       1 main.go:144] Start running
W0524 15:35:55.519177       1 main.go:184] No labels generated from any source
I0524 15:35:55.519191       1 main.go:187] Creating Labels
2023/05/24 15:35:55 Writing labels to NodeFeature CR
I0524 15:35:55.537127       1 main.go:119] Exiting
E0524 15:35:55.537151       1 main.go:95] failed to get NodeFeature object: nodefeatures.nfd.k8s-sigs.io "nvidia-features-for-" is forbidden: User "system:serviceaccount:gpu-feature-discovery:default" cannot get resource "nodefeatures" in API group "nfd.k8s-sigs.io" in the namespace "gpu-feature-discovery"

I'm guessing that the reason for these are lack of these 2 prerequisites:

The problem is, based on GKE docs, it's not possible to use Docker as the runtime anymore. All node image use containerd, and the nvidia-driver is only installed via a deamonset after startup. This works fine for getting basic GPU support up:

Output from a cuda pod on the GPU:

root@my-gpu-pod:/# nvidia-smi
Wed May 24 15:42:06 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13    Driver Version: 525.60.13    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P8     7W /  75W |      0MiB /  7680MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

So... can you please share some insight as to whether this could work on GKE? I'm skeptical given the lack of nvidia-docker, but I'd love to get a confirmation, and if so is there any plan to address this?

Thanks a lot

lmyslinski avatar May 24 '23 15:05 lmyslinski

You're right in pointing out that the issue is that the NVIDIA Container Toolkit and its component are not available by default on GKE systems. This means that the logic to inject the required devices and libraries into the GFD container (based on the content of the NVIDIA_VISIBLE_DEVICES environment variable is never triggered.

Note that it's not a requirement that Docker be used, and Containerd (which I understand is used in this context) can be configured to use the NVIDIA Container Runtime as a runtime. If you're able to modify the containerd config on your nodes, this may be an option. The NVIDIA Container Runtime can be configured in containerd and as a runtime class in k8s. If the GFD pod is started using this runtime class, then it should have access to the required devices.

I don't have any more concrete answers as to when this will be addressed, but we are aware of the different experience on GKE and are working to address it. For now, since the driver files -- including libnvidia-ml.so.1 -- are available at a well-known location on the host and consumed by the device plugin used on GKE. It may be possible to modify your GFD configuration to also have access to these file so that they can be found.

elezar avatar May 25 '23 04:05 elezar

Thanks a lot for an in-depth answer @elezar, I'll poke around and see what I can find

lmyslinski avatar May 25 '23 11:05 lmyslinski

+1 - looking forward to seeing GFD working on GKE clusters.

romilbhardwaj avatar Jun 26 '23 18:06 romilbhardwaj

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

github-actions[bot] avatar Aug 26 '24 04:08 github-actions[bot]

This issue was automatically closed due to inactivity.

github-actions[bot] avatar Sep 25 '24 04:09 github-actions[bot]