GPU-operator 安装失败,安装后pod状态不是running
我使用的系统是如下: Static hostname: ws Icon name: computer-server Chassis: server Machine ID: 46899e8e80ed4cac8d5f165ba5bc609e Boot ID: 5549cc0f6001472db57cda9c8ba159a3 Operating System: Ubuntu 22.04.4 LTS Kernel: Linux 5.15.0-117-generic Architecture: x86-64 Hardware Vendor: New H3C Technologies Co., Ltd. 我搭建k8s环境,使用的容器是containerd,进行gpu-operator部署出现了问题 1.首先使用如下命令安装
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia helm repo update helm install --version 22.9.1 --create-namespace --namespace gpu-operator-resources --devel nvidia/gpu-operator --wait --generate-name 2.安装后查看节点 kubectl get pods --all-namespaces | grep -v kube-system kubectl get pods --all-namespaces NAMESPACE NAME READY STATUS RESTARTS AGE gpu-operator-resources gpu-feature-discovery-mxj26 0/1 Init:CrashLoopBackOff 12 (45s ago) 37m gpu-operator-resources gpu-operator-1724229368-node-feature-discovery-master-d6fbhh9nt 1/1 Running 0 37m gpu-operator-resources gpu-operator-1724229368-node-feature-discovery-worker-v4pnf 0/1 CrashLoopBackOff 10 (15s ago) 37m gpu-operator-resources gpu-operator-7bd648f56b-tqb9l 1/1 Running 0 37m gpu-operator-resources nvidia-container-toolkit-daemonset-f57c5 0/1 Init:0/1 0 37m gpu-operator-resources nvidia-dcgm-exporter-srw42 0/1 Init:CrashLoopBackOff 12 (63s ago) 37m gpu-operator-resources nvidia-device-plugin-daemonset-2psnf 0/1 Init:CrashLoopBackOff 12 (52s ago) 37m gpu-operator-resources nvidia-driver-daemonset-mbxnh 0/1 Running 10 (5m12s ago) 37m gpu-operator-resources nvidia-operator-validator-q7q8r 0/1 Init:CrashLoopBackOff 12 (35s ago) 37m 3.我查看了大部分pod的describe和logs基本上和驱动有关,所以感觉是gpu-operator-resources nvidia-driver-daemonset-mbxnh 影响到其他节点,一下是它的describe和logs describe: Events: Type Reason Age From Message
Normal Scheduled 40m default-scheduler Successfully assigned gpu-operator-resources/nvidia-driver-daemonset-mbxnh to ws Normal Pulled 40m kubelet Container image "nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.5.1" already present on machine Normal Created 40m kubelet Created container k8s-driver-manager Normal Started 40m kubelet Started container k8s-driver-manager Normal Pulled 39m (x2 over 40m) kubelet Container image "nvcr.io/nvidia/driver:525.60.13-ubuntu22.04" already present on machine Normal Created 39m (x2 over 40m) kubelet Created container nvidia-driver-ctr Normal Started 39m (x2 over 40m) kubelet Started container nvidia-driver-ctr Warning Unhealthy 37m kubelet Startup probe errored: rpc error: code = NotFound desc = failed to exec in container: failed to load task: no running task found: task a814593a4fafebc899239c3ac84ea4a78abb76080f87988ad1af4d3380601391 not found: not found Warning Unhealthy 10m (x59 over 40m) kubelet Startup probe failed: Warning BackOff 50s (x108 over 37m) kubelet Back-off restarting failed container nvidia@ws:/cloud-native-stack/playbooks$ ^C nvidia@ws:/cloud-native-stack/playbooks$ logs: kubectl logs nvidia-driver-daemonset-mbxnh -n gpu-operator-resources Defaulted container "nvidia-driver-ctr" out of: nvidia-driver-ctr, k8s-driver-manager (init) DRIVER_ARCH is x86_64 Creating directory NVIDIA-Linux-x86_64-525.60.13 Verifying archive integrity... OK Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 525.60.13...................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
WARNING: Unable to determine the default X library path. The path /tmp/null/lib will be used, but this path was not detected in the ldconfig(8) cache, and no directory exists at this path, so it is likely that libraries installed there will not be found by the loader.
WARNING: You specified the '--no-kernel-modules' command line option, nvidia-installer will not install any kernel modules as part of this driver installation, and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation. Please ensure that NVIDIA kernel modules matching this driver version are installed separately.
========== NVIDIA Software Installer ==========
Starting installation of NVIDIA driver version 525.60.13 for Linux kernel version 5.15.0-117-generic
Stopping NVIDIA persistence daemon... Unloading NVIDIA driver kernel modules... Unmounting NVIDIA driver rootfs... Checking NVIDIA driver packages... Updating the package cache... W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/jammy/InRelease Temporary failure resolving 'archive.ubuntu.com' W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/jammy-updates/InRelease Temporary failure resolving 'archive.ubuntu.com' W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/jammy-security/InRelease Temporary failure resolving 'archive.ubuntu.com' W: Some index files failed to download. They have been ignored, or old ones used instead. Resolving Linux kernel version... Could not resolve Linux kernel version Stopping NVIDIA persistence daemon... Unloading NVIDIA driver kernel modules... Unmounting NVIDIA driver rootfs... nvidia@ws:/cloud-native-stack/playbooks$ 看到是不能到达这个网站,但是我可以ping通 ping archive.ubuntu.com PING archive.ubuntu.com (91.189.91.82) 56(84) bytes of data. 64 bytes from ubuntu-mirror-2.ps6.canonical.com (91.189.91.82): icmp_seq=1 ttl=41 time=415 ms 64 bytes from ubuntu-mirror-2.ps6.canonical.com (91.189.91.82): icmp_seq=3 ttl=41 time=419 ms 64 bytes from ubuntu-mirror-2.ps6.canonical.com (91.189.91.82): icmp_seq=4 ttl=41 time=412 ms 64 bytes from ubuntu-mirror-2.ps6.canonical.com (91.189.91.82): icmp_seq=5 ttl=41 time=411 ms ^X64 bytes from ubuntu-mirror-2.ps6.canonical.com (91.189.91.82): icmp_seq=6 ttl=41 time=413 ms 而且我重装了多次这个,有的时候日志没有这条不能访问的报错 4.后面我查看了daemon的配置也是对的,之前我锁定了系统的内核,我不知道对k8s里安驱动有影响没。 5.我想知道怎末让所有节点变成running
其中我的helm是version.BuildInfo{Version:"v3.10.2", GitCommit:"50f003e5ee8704ec937a756c646870227d7c8b58", GitTreeState:"clean", GoVersion:"go1.18.8"} 我使用的gpu-operator是22.9.1
kubectl logs nvidia-driver-daemonset-hqd5x -n gpu-operator-resources Defaulted container "nvidia-driver-ctr" out of: nvidia-driver-ctr, k8s-driver-manager (init) DRIVER_ARCH is x86_64 Creating directory NVIDIA-Linux-x86_64-525.60.13 Verifying archive integrity... OK Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 525.60.13...................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
WARNING: Unable to determine the default X library path. The path /tmp/null/lib will be used, but this path was not detected in the ldconfig(8) cache, and no directory exists at this path, so it is likely that libraries installed there will not be found by the loader.
WARNING: You specified the '--no-kernel-modules' command line option, nvidia-installer will not install any kernel modules as part of this driver installation, and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation. Please ensure that NVIDIA kernel modules matching this driver version are installed separately.
========== NVIDIA Software Installer ==========
Starting installation of NVIDIA driver version 525.60.13 for Linux kernel version 5.15.0-117-generic
Stopping NVIDIA persistence daemon... Unloading NVIDIA driver kernel modules... Unmounting NVIDIA driver rootfs... Checking NVIDIA driver packages... Updating the package cache...
kubectl describe pod nvidia-driver-daemonset-hqd5x -n gpu-operator-resources
Name: nvidia-driver-daemonset-hqd5x
Namespace: gpu-operator-resources
Priority: 2000001000
Priority Class Name: system-node-critical
Service Account: nvidia-driver
Node: ws/10.227.8.62
Start Time: Wed, 21 Aug 2024 17:47:51 +0800
Labels: app=nvidia-driver-daemonset
controller-revision-hash=79b496c599
pod-template-generation=1
Annotations: cni.projectcalico.org/containerID: d9b48d363449afdd8b5d9c0ddfff2e5348b04d501ec51b35d27c6882176d05fd
cni.projectcalico.org/podIP: 192.168.33.82/32
cni.projectcalico.org/podIPs: 192.168.33.82/32
Status: Running
IP: 192.168.33.82
IPs:
IP: 192.168.33.82
Controlled By: DaemonSet/nvidia-driver-daemonset
Init Containers:
k8s-driver-manager:
Container ID: containerd://4ab5e738b3d4213ed9c5ad661a13827f88f4f5e148356db5cefb283cbfc078cd
Image: nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.5.1
Image ID: nvcr.io/nvidia/cloud-native/k8s-driver-manager@sha256:6240c5912aabed789c672f3179b4a65e45511d10fa8c41a5de0d91644a792b14
Port:
Normal Scheduled 20m default-scheduler Successfully assigned gpu-operator-resources/nvidia-driver-daemonset-hqd5x to ws Normal Pulled 20m kubelet Container image "nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.5.1" already present on machine Normal Created 20m kubelet Created container k8s-driver-manager Normal Started 20m kubelet Started container k8s-driver-manager Normal Pulled 20m kubelet Container image "nvcr.io/nvidia/driver:525.60.13-ubuntu22.04" already present on machine Normal Created 20m kubelet Created container nvidia-driver-ctr Normal Started 20m kubelet Started container nvidia-driver-ctr Warning Unhealthy 19s (x98 over 19m) kubelet Startup probe failed: nvidia@ws:/cloud-native-stack/playbooks$ ^C nvidia@ws:/cloud-native-stack/playbooks$
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed. To skip these checks, apply the "lifecycle/frozen" label.
Hi @452256 , can you try with the latest version of gpu-operator and see if you are still hitting this issue? There has been a lot of changes since v22.9.1 which is more than three years old now and we are now at v25.10.0. The context may now be outdated.
Given that gpu-operator v22.9.1 is EOL now, I would encourage you to try latest version and see if you still see this issue. If this issue is still relevant with the latest version of the NVIDIA GPU Operator, please feel free to reopen it or open a new one with updated details.
你好@452256您能否尝试使用最新版本的 gpu-operator,看看是否仍然存在此问题?自 v22.9.1 版本(现已超过三年)以来,该版本已进行了多次更改,目前版本为 v25.10.0。因此,您提供的信息可能已过时。
鉴于 gpu-operator v22.9.1 已停止维护,我建议您尝试最新版本,看看是否仍然存在此问题。 如果此问题在使用最新版本的 NVIDIA GPU Operator 后仍然存在,请随时重新打开此问题或创建一个新问题并提供更新后的详细信息。
V25.10.0 still has this issue
@liangzai006 please open a new issue with details about what is failing in your install. Please attach must-gather logs when submitting the issue so that someone from team can triage the issue.