gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

Following gpu-operator documentation will break RKE2 cluster after reboot

Open aiicore opened this issue 1 year ago • 4 comments
trafficstars

RKE2 docs says only about passing the config for RKE2's internal CONTAINERD_SOCKET: https://docs.rke2.io/advanced?_highlight=gpu#deploy-nvidia-operator

Nvidia's also about CONTAINERD_CONFIG: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#rancher-kubernetes-engine-2

Following gpu-operator documentation, those things will happen:

  • gpu-operator will write containerd config into /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl
  • rke2 will pick it up as a template and make dedicated contained config: /var/lib/rancher/rke2/agent/etc/containerd/config.toml
  • cluster will not get up after reboot, since the config provided by gpu-operator does not work with rke2

The most significant errors in the logs would be:

Sep 13 14:08:23 rke2 rke2[10318]: time="2024-09-13T14:08:23Z" level=info msg="Pod for etcd not synced (pod sandbox has changed), retrying"
Sep 13 14:08:23 rke2 rke2[10318]: time="2024-09-13T14:08:23Z" level=info msg="Waiting for API server to become available"
Sep 13 14:08:25 rke2 rke2[10318]: time="2024-09-13T14:08:25Z" level=warning msg="Failed to list nodes with etcd role: runtime core not ready"
Sep 13 14:08:25 rke2 rke2[10318]: time="2024-09-13T14:08:25Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error"

Following RKE2 docs about passing only CONTAINERD_SOCKET works, since gpu-operator will write it's (not working with rke2 config) into /etc/containerd/config.toml, even though containerd is not installed at the OS level.

root@rke2:~# apt list --installed | grep containerd

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

root@rke2:~#

Looks like the containerd config, provided by gpu-operator with RKE2, doesn't matter since RKE2 is able to detect nvidia-container-runtime and configure it's own containerd conifg with nvidia runtime class:

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
  runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
  BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime"
  SystemdCgroup = true

Steps to reproduce on Ubuntu 22.04:

Following Nvidia's docs breaks RKE2 cluster after reboot:

helm install gpu-operator -n gpu-operator --create-namespace \
  nvidia/gpu-operator \
    --set toolkit.env[0].name=CONTAINERD_CONFIG \
    --set toolkit.env[0].value=/var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl \
    --set toolkit.env[1].name=CONTAINERD_SOCKET \
    --set toolkit.env[1].value=/run/k3s/containerd/containerd.sock \
    --set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS \
    --set toolkit.env[2].value=nvidia \
    --set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT \
    --set-string toolkit.env[3].value=true

Following RKE2's docs works fine:

helm install gpu-operator -n gpu-operator --create-namespace \
  nvidia/gpu-operator \
    --set toolkit.env[0].name=CONTAINERD_SOCKET \
    --set toolkit.env[0].value=/run/k3s/containerd/containerd.sock

Could someone verify the docs?

aiicore avatar Sep 16 '24 11:09 aiicore

Anyone on the NVIDIA team object to replacing our sample command with a reference to the RKE2 docs? That's my preference.

https://docs.rke2.io/advanced#deploy-nvidia-operator

mikemckiernan avatar Sep 16 '24 16:09 mikemckiernan

I'm using Ubuntu 22.04 with an NVIDIA RTX A2000 12GB and K8s 1.27.11+RKE2r1.

Is there any problem using the driver in version 560 and not 535 as indicated in the RKE Doc?

DevFontes avatar Sep 19 '24 17:09 DevFontes

I'm fairly confident that using the 560 driver, or any driver covered in the product docs, is OK.

However, I'd like SME input from my teammates. When I followed the RKE doc, I've found that I need to specify runtimeClassName--like the sample nbody workload. I can't choose what other people prefer or dislike, but I happen to dislike that approach.

mikemckiernan avatar Sep 20 '24 12:09 mikemckiernan

@mikemckiernan I think it's due gpu-operator setting nvidia runtime class as the default in containerd. RKE2 just adds another runtime, which in my opinion is more clear approach. I don't know why gpu-operator have this option, maybe it's due to be consistent with docker? I remember that long time ago I needed to install nvidia runtime for docker and change default docker runtime for nvidia to make it work.

If the gpu-operator would work normally with RKE2, so creating valid config.toml.tmpl, nvidia runtime class would be the default, when CONTAINERD_SET_AS_DEFAULT=true.

aiicore avatar Sep 20 '24 12:09 aiicore

I am not sure what version of the GPU operator you are using, but would the following values file work for you, @aiicore?

https://github.com/defenseunicorns/uds-rke2/blob/main/packages/nvidia-gpu-operator/values/nvidia-gpu-operator-values.yaml

justinthelaw avatar Nov 13 '24 19:11 justinthelaw

Is it possible to set the nvidia toolkit not to restart/configure containerd at all? rke2 configures nvidia runtime as well..

xhejtman avatar Nov 23 '24 12:11 xhejtman

Anyone on the NVIDIA team object to replacing our sample command with a reference to the RKE2 docs? That's my preference.

https://docs.rke2.io/advanced#deploy-nvidia-operator

I tried this approach today, but it did not work. I did not install the drivers manually on the worker nodes since the documentation states that the nvidia-driver-daemonset can handle it.

 k get po 
NAME                                                          READY   STATUS     RESTARTS          AGE
gpu-feature-discovery-5wmwq                                   0/1     Init:0/1   0                 79s
gpu-feature-discovery-jfp2t                                   0/1     Init:0/1   0                 3m32s
gpu-feature-discovery-rnc67                                   0/1     Init:0/1   0                 2m23s
gpu-feature-discovery-sbgss                                   0/1     Init:0/1   0                 4m11s
gpu-operator-868d98fc79-qvbmf                                 1/1     Running    3 (4m36s ago)     33m
gpu-operator-node-feature-discovery-gc-74d9855689-8txzj       1/1     Running    6 (2m47s ago)     33m
gpu-operator-node-feature-discovery-master-5cb7f479cb-tfz26   1/1     Running    6 (2m47s ago)     33m
gpu-operator-node-feature-discovery-worker-h6bk4              1/1     Running    124 (103s ago)    17h
gpu-operator-node-feature-discovery-worker-n6djx              1/1     Running    147 (2m47s ago)   17h
gpu-operator-node-feature-discovery-worker-w8nl6              1/1     Running    136 (4m36s ago)   17h
gpu-operator-node-feature-discovery-worker-x99sd              1/1     Running    144 (3m56s ago)   17h
nvidia-container-toolkit-daemonset-cdttk                      0/1     Init:0/1   0                 79s
nvidia-container-toolkit-daemonset-lgqcv                      0/1     Init:0/1   0                 2m23s
nvidia-container-toolkit-daemonset-pggvt                      0/1     Init:0/1   0                 3m32s
nvidia-container-toolkit-daemonset-z9ss8                      0/1     Init:0/1   0                 4m11s
nvidia-dcgm-exporter-742dv                                    0/1     Init:0/1   0                 4m11s
nvidia-dcgm-exporter-bf752                                    0/1     Init:0/1   0                 3m32s
nvidia-dcgm-exporter-vc2lp                                    0/1     Init:0/1   0                 79s
nvidia-dcgm-exporter-xmc6f                                    0/1     Init:0/1   0                 2m23s
nvidia-device-plugin-daemonset-9lq2p                          0/1     Init:0/1   0                 4m11s
nvidia-device-plugin-daemonset-lzjrj                          0/1     Init:0/1   0                 3m32s
nvidia-device-plugin-daemonset-sjh8m                          0/1     Init:0/1   0                 79s
nvidia-device-plugin-daemonset-t5466                          0/1     Init:0/1   0                 2m23s
nvidia-driver-daemonset-g7b5j                                 0/1     Running    215 (103s ago)    17h
nvidia-driver-daemonset-v6g9b                                 0/1     Running    197 (4m36s ago)   17h
nvidia-driver-daemonset-wlbjd                                 0/1     Running    220 (3m56s ago)   17h
nvidia-driver-daemonset-zc24q                                 0/1     Running    209 (2m47s ago)   17h
nvidia-operator-validator-6kdjz                               0/1     Init:0/4   0                 4m11s
nvidia-operator-validator-bjm2r                               0/1     Init:0/4   0                 3m32s
nvidia-operator-validator-jdjmt                               0/1     Init:0/4   0                 79s
nvidia-operator-validator-z8rpb                               0/1     Init:0/4   0                 2m23s

 k logs -f nvidia-container-toolkit-daemonset-jts8j
Defaulted container "nvidia-container-toolkit-ctr" out of: nvidia-container-toolkit-ctr, driver-validation (init)
Error from server: Get "https://10.250.23.106:10250/containerLogs/gpu-operator/nvidia-container-toolkit-daemonset-jts8j/nvidia-container-toolkit-ctr?follow=true": proxy error from 127.0.0.1:9345 while dialing 10.250.23.106:10250, code 502: 502 Bad Gateway

govindkailas avatar Mar 27 '25 18:03 govindkailas

Same Problem here @govindkailas with v1.31.7+rke2r1

danielphilipp avatar Apr 11 '25 10:04 danielphilipp

Maybe I found a fix. On the GPU nodes add /usr/local/nvidia/toolkit/ to your PATH environment variable because since RKE 1.28.15 it detects runtimes based on PATH (see bottom of page https://docs.rke2.io/advanced#operator-installation). The NVIDIA runtime is located in this directory on your OS (written by GPU Operator Toolkit). With the modified PATH and the instructions on https://docs.rke2.io/advanced it works for me. But you still have to use runtimeClassName: nvidia with your pods

danielphilipp avatar Apr 11 '25 11:04 danielphilipp

Hey @danielphilipp , I've taken down the cluster. I'll give this approach a shot in the next few days.

govindkailas avatar Apr 16 '25 03:04 govindkailas

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed. To skip these checks, apply the "lifecycle/frozen" label.

github-actions[bot] avatar Nov 04 '25 22:11 github-actions[bot]