gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

GPU-operator 安装失败,安装后pod状态不是running

Open 452256 opened this issue 1 year ago • 3 comments

我使用的系统是如下: Static hostname: ws Icon name: computer-server Chassis: server Machine ID: 46899e8e80ed4cac8d5f165ba5bc609e Boot ID: 5549cc0f6001472db57cda9c8ba159a3 Operating System: Ubuntu 22.04.4 LTS Kernel: Linux 5.15.0-117-generic Architecture: x86-64 Hardware Vendor: New H3C Technologies Co., Ltd. 我搭建k8s环境,使用的容器是containerd,进行gpu-operator部署出现了问题 1.首先使用如下命令安装

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia helm repo update helm install --version 22.9.1 --create-namespace --namespace gpu-operator-resources --devel nvidia/gpu-operator --wait --generate-name 2.安装后查看节点 kubectl get pods --all-namespaces | grep -v kube-system kubectl get pods --all-namespaces NAMESPACE NAME READY STATUS RESTARTS AGE gpu-operator-resources gpu-feature-discovery-mxj26 0/1 Init:CrashLoopBackOff 12 (45s ago) 37m gpu-operator-resources gpu-operator-1724229368-node-feature-discovery-master-d6fbhh9nt 1/1 Running 0 37m gpu-operator-resources gpu-operator-1724229368-node-feature-discovery-worker-v4pnf 0/1 CrashLoopBackOff 10 (15s ago) 37m gpu-operator-resources gpu-operator-7bd648f56b-tqb9l 1/1 Running 0 37m gpu-operator-resources nvidia-container-toolkit-daemonset-f57c5 0/1 Init:0/1 0 37m gpu-operator-resources nvidia-dcgm-exporter-srw42 0/1 Init:CrashLoopBackOff 12 (63s ago) 37m gpu-operator-resources nvidia-device-plugin-daemonset-2psnf 0/1 Init:CrashLoopBackOff 12 (52s ago) 37m gpu-operator-resources nvidia-driver-daemonset-mbxnh 0/1 Running 10 (5m12s ago) 37m gpu-operator-resources nvidia-operator-validator-q7q8r 0/1 Init:CrashLoopBackOff 12 (35s ago) 37m 3.我查看了大部分pod的describe和logs基本上和驱动有关,所以感觉是gpu-operator-resources nvidia-driver-daemonset-mbxnh 影响到其他节点,一下是它的describe和logs describe: Events: Type Reason Age From Message


Normal Scheduled 40m default-scheduler Successfully assigned gpu-operator-resources/nvidia-driver-daemonset-mbxnh to ws Normal Pulled 40m kubelet Container image "nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.5.1" already present on machine Normal Created 40m kubelet Created container k8s-driver-manager Normal Started 40m kubelet Started container k8s-driver-manager Normal Pulled 39m (x2 over 40m) kubelet Container image "nvcr.io/nvidia/driver:525.60.13-ubuntu22.04" already present on machine Normal Created 39m (x2 over 40m) kubelet Created container nvidia-driver-ctr Normal Started 39m (x2 over 40m) kubelet Started container nvidia-driver-ctr Warning Unhealthy 37m kubelet Startup probe errored: rpc error: code = NotFound desc = failed to exec in container: failed to load task: no running task found: task a814593a4fafebc899239c3ac84ea4a78abb76080f87988ad1af4d3380601391 not found: not found Warning Unhealthy 10m (x59 over 40m) kubelet Startup probe failed: Warning BackOff 50s (x108 over 37m) kubelet Back-off restarting failed container nvidia@ws:/cloud-native-stack/playbooks$ ^C nvidia@ws:/cloud-native-stack/playbooks$ logs: kubectl logs nvidia-driver-daemonset-mbxnh -n gpu-operator-resources Defaulted container "nvidia-driver-ctr" out of: nvidia-driver-ctr, k8s-driver-manager (init) DRIVER_ARCH is x86_64 Creating directory NVIDIA-Linux-x86_64-525.60.13 Verifying archive integrity... OK Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 525.60.13...................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

WARNING: Unable to determine the default X library path. The path /tmp/null/lib will be used, but this path was not detected in the ldconfig(8) cache, and no directory exists at this path, so it is likely that libraries installed there will not be found by the loader.

WARNING: You specified the '--no-kernel-modules' command line option, nvidia-installer will not install any kernel modules as part of this driver installation, and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation. Please ensure that NVIDIA kernel modules matching this driver version are installed separately.

========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver version 525.60.13 for Linux kernel version 5.15.0-117-generic

Stopping NVIDIA persistence daemon... Unloading NVIDIA driver kernel modules... Unmounting NVIDIA driver rootfs... Checking NVIDIA driver packages... Updating the package cache... W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/jammy/InRelease Temporary failure resolving 'archive.ubuntu.com' W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/jammy-updates/InRelease Temporary failure resolving 'archive.ubuntu.com' W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/jammy-security/InRelease Temporary failure resolving 'archive.ubuntu.com' W: Some index files failed to download. They have been ignored, or old ones used instead. Resolving Linux kernel version... Could not resolve Linux kernel version Stopping NVIDIA persistence daemon... Unloading NVIDIA driver kernel modules... Unmounting NVIDIA driver rootfs... nvidia@ws:/cloud-native-stack/playbooks$ 看到是不能到达这个网站,但是我可以ping通 ping archive.ubuntu.com PING archive.ubuntu.com (91.189.91.82) 56(84) bytes of data. 64 bytes from ubuntu-mirror-2.ps6.canonical.com (91.189.91.82): icmp_seq=1 ttl=41 time=415 ms 64 bytes from ubuntu-mirror-2.ps6.canonical.com (91.189.91.82): icmp_seq=3 ttl=41 time=419 ms 64 bytes from ubuntu-mirror-2.ps6.canonical.com (91.189.91.82): icmp_seq=4 ttl=41 time=412 ms 64 bytes from ubuntu-mirror-2.ps6.canonical.com (91.189.91.82): icmp_seq=5 ttl=41 time=411 ms ^X64 bytes from ubuntu-mirror-2.ps6.canonical.com (91.189.91.82): icmp_seq=6 ttl=41 time=413 ms 而且我重装了多次这个,有的时候日志没有这条不能访问的报错 4.后面我查看了daemon的配置也是对的,之前我锁定了系统的内核,我不知道对k8s里安驱动有影响没。 5.我想知道怎末让所有节点变成running

452256 avatar Aug 21 '24 09:08 452256

其中我的helm是version.BuildInfo{Version:"v3.10.2", GitCommit:"50f003e5ee8704ec937a756c646870227d7c8b58", GitTreeState:"clean", GoVersion:"go1.18.8"} 我使用的gpu-operator是22.9.1

452256 avatar Aug 21 '24 09:08 452256

kubectl logs nvidia-driver-daemonset-hqd5x -n gpu-operator-resources Defaulted container "nvidia-driver-ctr" out of: nvidia-driver-ctr, k8s-driver-manager (init) DRIVER_ARCH is x86_64 Creating directory NVIDIA-Linux-x86_64-525.60.13 Verifying archive integrity... OK Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 525.60.13...................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

WARNING: Unable to determine the default X library path. The path /tmp/null/lib will be used, but this path was not detected in the ldconfig(8) cache, and no directory exists at this path, so it is likely that libraries installed there will not be found by the loader.

WARNING: You specified the '--no-kernel-modules' command line option, nvidia-installer will not install any kernel modules as part of this driver installation, and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation. Please ensure that NVIDIA kernel modules matching this driver version are installed separately.

========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver version 525.60.13 for Linux kernel version 5.15.0-117-generic

Stopping NVIDIA persistence daemon... Unloading NVIDIA driver kernel modules... Unmounting NVIDIA driver rootfs... Checking NVIDIA driver packages... Updating the package cache...

452256 avatar Aug 21 '24 10:08 452256

kubectl describe pod nvidia-driver-daemonset-hqd5x -n gpu-operator-resources Name: nvidia-driver-daemonset-hqd5x Namespace: gpu-operator-resources Priority: 2000001000 Priority Class Name: system-node-critical Service Account: nvidia-driver Node: ws/10.227.8.62 Start Time: Wed, 21 Aug 2024 17:47:51 +0800 Labels: app=nvidia-driver-daemonset controller-revision-hash=79b496c599 pod-template-generation=1 Annotations: cni.projectcalico.org/containerID: d9b48d363449afdd8b5d9c0ddfff2e5348b04d501ec51b35d27c6882176d05fd cni.projectcalico.org/podIP: 192.168.33.82/32 cni.projectcalico.org/podIPs: 192.168.33.82/32 Status: Running IP: 192.168.33.82 IPs: IP: 192.168.33.82 Controlled By: DaemonSet/nvidia-driver-daemonset Init Containers: k8s-driver-manager: Container ID: containerd://4ab5e738b3d4213ed9c5ad661a13827f88f4f5e148356db5cefb283cbfc078cd Image: nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.5.1 Image ID: nvcr.io/nvidia/cloud-native/k8s-driver-manager@sha256:6240c5912aabed789c672f3179b4a65e45511d10fa8c41a5de0d91644a792b14 Port: Host Port: Command: driver-manager Args: uninstall_driver State: Terminated Reason: Completed Exit Code: 0 Started: Wed, 21 Aug 2024 17:47:52 +0800 Finished: Wed, 21 Aug 2024 17:48:00 +0800 Ready: True Restart Count: 0 Environment: NODE_NAME: (v1:spec.nodeName) NVIDIA_VISIBLE_DEVICES: void ENABLE_GPU_POD_EVICTION: true ENABLE_AUTO_DRAIN: true DRAIN_USE_FORCE: false DRAIN_POD_SELECTOR_LABEL: DRAIN_TIMEOUT_SECONDS: 0s DRAIN_DELETE_EMPTYDIR_DATA: false OPERATOR_NAMESPACE: gpu-operator-resources (v1:metadata.namespace) Mounts: /host from host-root (ro) /run/nvidia from run-nvidia (rw) /sys from host-sys (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jw5gh (ro) Containers: nvidia-driver-ctr: Container ID: containerd://c36a64b9876a1d05b1d94715aa17028d33c9cac5133147a51d110acad6868936 Image: nvcr.io/nvidia/driver:525.60.13-ubuntu22.04 Image ID: nvcr.io/nvidia/driver@sha256:3fd6e1869fe95f922499a005eadf27c8e3931e63d04253cb26cc47276086a0fa Port: Host Port: Command: nvidia-driver Args: init State: Running Started: Wed, 21 Aug 2024 18:05:19 +0800 Last State: Terminated Reason: Error Exit Code: 1 Started: Wed, 21 Aug 2024 18:00:12 +0800 Finished: Wed, 21 Aug 2024 18:04:38 +0800 Ready: False Restart Count: 4 Startup: exec [sh -c lsmod | grep nvidia] delay=30s timeout=1s period=10s #success=1 #failure=60 Environment: Mounts: /dev/log from dev-log (rw) /host-etc/os-release from host-os-release (ro) /run/mellanox/drivers from run-mellanox-drivers (rw) /run/mellanox/drivers/usr/src from mlnx-ofed-usr-src (rw) /run/nvidia from run-nvidia (rw) /run/nvidia-topologyd from run-nvidia-topologyd (rw) /var/log from var-log (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jw5gh (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: run-nvidia: Type: HostPath (bare host directory volume) Path: /run/nvidia HostPathType: DirectoryOrCreate var-log: Type: HostPath (bare host directory volume) Path: /var/log HostPathType: dev-log: Type: HostPath (bare host directory volume) Path: /dev/log HostPathType: host-os-release: Type: HostPath (bare host directory volume) Path: /etc/os-release HostPathType: run-nvidia-topologyd: Type: HostPath (bare host directory volume) Path: /run/nvidia-topologyd HostPathType: DirectoryOrCreate mlnx-ofed-usr-src: Type: HostPath (bare host directory volume) Path: /run/mellanox/drivers/usr/src HostPathType: DirectoryOrCreate run-mellanox-drivers: Type: HostPath (bare host directory volume) Path: /run/mellanox/drivers HostPathType: DirectoryOrCreate run-nvidia-validations: Type: HostPath (bare host directory volume) Path: /run/nvidia/validations HostPathType: DirectoryOrCreate host-root: Type: HostPath (bare host directory volume) Path: / HostPathType: host-sys: Type: HostPath (bare host directory volume) Path: /sys HostPathType: Directory kube-api-access-jw5gh: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: DownwardAPI: true QoS Class: BestEffort Node-Selectors: nvidia.com/gpu.deploy.driver=true Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists node.kubernetes.io/pid-pressure:NoSchedule op=Exists node.kubernetes.io/unreachable:NoExecute op=Exists node.kubernetes.io/unschedulable:NoSchedule op=Exists nvidia.com/gpu:NoSchedule op=Exists Events: Type Reason Age From Message


Normal Scheduled 20m default-scheduler Successfully assigned gpu-operator-resources/nvidia-driver-daemonset-hqd5x to ws Normal Pulled 20m kubelet Container image "nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.5.1" already present on machine Normal Created 20m kubelet Created container k8s-driver-manager Normal Started 20m kubelet Started container k8s-driver-manager Normal Pulled 20m kubelet Container image "nvcr.io/nvidia/driver:525.60.13-ubuntu22.04" already present on machine Normal Created 20m kubelet Created container nvidia-driver-ctr Normal Started 20m kubelet Started container nvidia-driver-ctr Warning Unhealthy 19s (x98 over 19m) kubelet Startup probe failed: nvidia@ws:/cloud-native-stack/playbooks$ ^C nvidia@ws:/cloud-native-stack/playbooks$

452256 avatar Aug 21 '24 10:08 452256

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed. To skip these checks, apply the "lifecycle/frozen" label.

github-actions[bot] avatar Nov 04 '25 22:11 github-actions[bot]

Hi @452256 , can you try with the latest version of gpu-operator and see if you are still hitting this issue? There has been a lot of changes since v22.9.1 which is more than three years old now and we are now at v25.10.0. The context may now be outdated.

Given that gpu-operator v22.9.1 is EOL now, I would encourage you to try latest version and see if you still see this issue. If this issue is still relevant with the latest version of the NVIDIA GPU Operator, please feel free to reopen it or open a new one with updated details.

rahulait avatar Nov 13 '25 16:11 rahulait

你好@452256您能否尝试使用最新版本的 gpu-operator,看看是否仍然存在此问题?自 v22.9.1 版本(现已超过三年)以来,该版本已进行了多次更改,目前版本为 v25.10.0。因此,您提供的信息可能已过时。

鉴于 gpu-operator v22.9.1 已停止维护,我建议您尝试最新版本,看看是否仍然存在此问题。 如果此问题在使用最新版本的 NVIDIA GPU Operator 后仍然存在,请随时重新打开此问题或创建一个新问题并提供更新后的详细信息。

V25.10.0 still has this issue

liangzai006 avatar Nov 25 '25 07:11 liangzai006

@liangzai006 please open a new issue with details about what is failing in your install. Please attach must-gather logs when submitting the issue so that someone from team can triage the issue.

rahulait avatar Nov 25 '25 15:11 rahulait