Enabling gpu on microk8s, pod/nvidia-driver-daemonset restart many times at status CrashLoopBackOff
1. Quick Debug Information
- OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu 22.04.4 LTS
- Kernel Version: 5.15.0-112-generic
- Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): containerd://1.6.28
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): microk8s
2. Issue or feature description
Enabling gpu on microk8s, pod/nvidia-driver-daemonset restart many times at status CrashLoopBackOff
3. Steps to reproduce the issue
I follow instructions from https://www.nvidia.com/en-us/on-demand/session/gtcspring21-ss33138/ to install microk8s on Ubuntu 22.04. Instructions mentioned that not install nvidia driver. I tried both but the same result.
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.5.0/local_installers/cuda-repo-ubuntu2204-12-5-local_12.5.0-555.42.02-1_amd64.deb
dpkg -i cuda-repo-ubuntu2204-12-5-local_12.5.0-555.42.02-1_amd64.deb
cp /var/cuda-repo-ubuntu2204-12-5-local/cuda-*-keyring.gpg /usr/share/keyrings/
apt-get update
apt install nvidia-fabricmanager-555
snap install microk8s --classic --channel=1.30/stable
microk8s enable gpu
4. Information to attach (optional if deemed irrelevant)
- [ ] kubernetes pods status:
microk8s kubectl get pods -n gpu-operator-resources
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-7c8n9 0/1 Init:0/1 0 28m
gpu-operator-56b6cf869d-fj2jf 1/1 Running 0 29m
gpu-operator-node-feature-discovery-gc-5fcdc8894b-4688x 1/1 Running 0 29m
gpu-operator-node-feature-discovery-master-7d84b856d7-829tk 1/1 Running 0 29m
gpu-operator-node-feature-discovery-worker-k2szk 1/1 Running 0 29m
nvidia-container-toolkit-daemonset-blswt 0/1 Init:0/1 0 28m
nvidia-dcgm-exporter-5m454 0/1 Init:0/1 0 28m
nvidia-device-plugin-daemonset-7xcjh 0/1 Init:0/1 0 28m
nvidia-driver-daemonset-99xrx 0/1 Running 6 (6m48s ago) 28m
nvidia-operator-validator-msxd8 0/1 Init:0/4 0 28m
- [ ] kubernetes daemonset status:
microk8s kubectl get ds -n gpu-operator-resources
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
gpu-feature-discovery 1 1 0 1 0 nvidia.com/gpu.deploy.gpu-feature-discovery=true 31m
gpu-operator-node-feature-discovery-worker 1 1 1 1 1 <none> 31m
nvidia-container-toolkit-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.container-toolkit=true 31m
nvidia-dcgm-exporter 1 1 0 1 0 nvidia.com/gpu.deploy.dcgm-exporter=true 31m
nvidia-device-plugin-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.device-plugin=true 31m
nvidia-driver-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.driver=true 31m
nvidia-mig-manager 0 0 0 0 0 nvidia.com/gpu.deploy.mig-manager=true 31m
nvidia-operator-validator 1 1 0 1 0 nvidia.com/gpu.deploy.operator-validator=true 31m
- [ ] If a pod/ds is in an error state or pending state
microk8s kubectl describe pod -n gpu-operator-resources nvidia-driver-daemonset
Name: nvidia-driver-daemonset-99xrx
Namespace: gpu-operator-resources
Priority: 2000001000
Priority Class Name: system-node-critical
Service Account: nvidia-driver
Node: microk8s-node01/10.64.43.201
Start Time: Wed, 12 Jun 2024 03:29:39 +0000
Labels: app=nvidia-driver-daemonset
app.kubernetes.io/component=nvidia-driver
app.kubernetes.io/managed-by=gpu-operator
controller-revision-hash=7974d7cccc
helm.sh/chart=gpu-operator-v23.9.1
nvidia.com/precompiled=false
pod-template-generation=1
Annotations: cni.projectcalico.org/containerID: ed6142941d825196ff2e68a23c23f4daab594989742c6315b958541ffbb9a04a
cni.projectcalico.org/podIP: 10.1.47.71/32
cni.projectcalico.org/podIPs: 10.1.47.71/32
kubectl.kubernetes.io/default-container: nvidia-driver-ctr
Status: Running
IP: 10.1.47.71
IPs:
IP: 10.1.47.71
Controlled By: DaemonSet/nvidia-driver-daemonset
Init Containers:
k8s-driver-manager:
Container ID: containerd://158252567d0e2716f25528c4adb609600944f79879ae42bb3e898fb63aeaba79
Image: nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.6.5
Image ID: nvcr.io/nvidia/cloud-native/k8s-driver-manager@sha256:27c44f4720a4abf780217bd5e7903e4a008ebdbcf71238c4f106a0c22654776c
Port: <none>
Host Port: <none>
Command:
driver-manager
Args:
uninstall_driver
State: Terminated
Reason: Completed
Exit Code: 0
Started: Wed, 12 Jun 2024 03:29:57 +0000
Finished: Wed, 12 Jun 2024 03:29:59 +0000
Ready: True
Restart Count: 0
Environment:
NODE_NAME: (v1:spec.nodeName)
NVIDIA_VISIBLE_DEVICES: void
ENABLE_GPU_POD_EVICTION: true
ENABLE_AUTO_DRAIN: false
DRAIN_USE_FORCE: false
DRAIN_POD_SELECTOR_LABEL:
DRAIN_TIMEOUT_SECONDS: 0s
DRAIN_DELETE_EMPTYDIR_DATA: false
OPERATOR_NAMESPACE: gpu-operator-resources (v1:metadata.namespace)
Mounts:
/host from host-root (ro)
/run/nvidia from run-nvidia (rw)
/sys from host-sys (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4rjhs (ro)
Containers:
nvidia-driver-ctr:
Container ID: containerd://f84013acb0c49a4a7a45b691c49d42637f761e6342f1e821e37de9d447c60d0b
Image: nvcr.io/nvidia/driver:535.129.03-ubuntu22.04
Image ID: nvcr.io/nvidia/driver@sha256:3981d34191e355a8c96a926f4b00254dba41f89def7ed2c853e681a72e3f14eb
Port: <none>
Host Port: <none>
Command:
nvidia-driver
Args:
init
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 2
Started: Wed, 12 Jun 2024 03:54:34 +0000
Finished: Wed, 12 Jun 2024 03:59:04 +0000
Ready: False
Restart Count: 6
Startup: exec [sh -c nvidia-smi && touch /run/nvidia/validations/.driver-ctr-ready] delay=60s timeout=60s period=10s #success=1 #failure=120
Environment: <none>
Mounts:
/dev/log from dev-log (rw)
/host-etc/os-release from host-os-release (ro)
/lib/firmware from nv-firmware (rw)
/run/mellanox/drivers from run-mellanox-drivers (rw)
/run/mellanox/drivers/usr/src from mlnx-ofed-usr-src (rw)
/run/nvidia from run-nvidia (rw)
/run/nvidia-topologyd from run-nvidia-topologyd (rw)
/sys/devices/system/memory/auto_online_blocks from sysfs-memory-online (rw)
/sys/module/firmware_class/parameters/path from firmware-search-path (rw)
/var/log from var-log (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4rjhs (ro)
Conditions:
Type Status
PodReadyToStartContainers True
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
run-nvidia:
Type: HostPath (bare host directory volume)
Path: /run/nvidia
HostPathType: DirectoryOrCreate
var-log:
Type: HostPath (bare host directory volume)
Path: /var/log
HostPathType:
dev-log:
Type: HostPath (bare host directory volume)
Path: /dev/log
HostPathType:
host-os-release:
Type: HostPath (bare host directory volume)
Path: /etc/os-release
HostPathType:
run-nvidia-topologyd:
Type: HostPath (bare host directory volume)
Path: /run/nvidia-topologyd
HostPathType: DirectoryOrCreate
mlnx-ofed-usr-src:
Type: HostPath (bare host directory volume)
Path: /run/mellanox/drivers/usr/src
HostPathType: DirectoryOrCreate
run-mellanox-drivers:
Type: HostPath (bare host directory volume)
Path: /run/mellanox/drivers
HostPathType: DirectoryOrCreate
run-nvidia-validations:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/validations
HostPathType: DirectoryOrCreate
host-root:
Type: HostPath (bare host directory volume)
Path: /
HostPathType:
host-sys:
Type: HostPath (bare host directory volume)
Path: /sys
HostPathType: Directory
firmware-search-path:
Type: HostPath (bare host directory volume)
Path: /sys/module/firmware_class/parameters/path
HostPathType:
sysfs-memory-online:
Type: HostPath (bare host directory volume)
Path: /sys/devices/system/memory/auto_online_blocks
HostPathType:
nv-firmware:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/driver/lib/firmware
HostPathType: DirectoryOrCreate
kube-api-access-4rjhs:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: nvidia.com/gpu.deploy.driver=true
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning BackOff 3m11s (x34 over 27m) kubelet Back-off restarting failed container nvidia-driver-ctr in pod nvidia-driver-daemonset-99xrx_gpu-operator-resources(5c2ab5b3-bcae-466e-ba84-46acbc55cc41)
- [ ] If a pod/ds is in an error state or pending state
microk8s kubectl logs -n gpu-operator-resources nvidia-driver-daemonset-99xrx --all-containers
Getting current value of the 'nvidia.com/gpu.deploy.operator-validator' node label
Current value of 'nvidia.com/gpu.deploy.operator-validator=true'
Getting current value of the 'nvidia.com/gpu.deploy.container-toolkit' node label
Current value of 'nvidia.com/gpu.deploy.container-toolkit=true'
Getting current value of the 'nvidia.com/gpu.deploy.device-plugin' node label
Current value of 'nvidia.com/gpu.deploy.device-plugin=true'
Getting current value of the 'nvidia.com/gpu.deploy.gpu-feature-discovery' node label
Current value of 'nvidia.com/gpu.deploy.gpu-feature-discovery=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm-exporter' node label
Current value of 'nvidia.com/gpu.deploy.dcgm-exporter=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm' node label
Current value of 'nvidia.com/gpu.deploy.dcgm=true'
Getting current value of the 'nvidia.com/gpu.deploy.mig-manager' node label
Current value of 'nvidia.com/gpu.deploy.mig-manager='
Getting current value of the 'nvidia.com/gpu.deploy.nvsm' node label
Current value of 'nvidia.com/gpu.deploy.nvsm='
Getting current value of the 'nvidia.com/gpu.deploy.sandbox-validator' node label
Current value of 'nvidia.com/gpu.deploy.sandbox-validator='
Getting current value of the 'nvidia.com/gpu.deploy.sandbox-device-plugin' node label
Current value of 'nvidia.com/gpu.deploy.sandbox-device-plugin='
Getting current value of the 'nvidia.com/gpu.deploy.vgpu-device-manager' node label
Current value of 'nvidia.com/gpu.deploy.vgpu-device-manager='
Current value of AUTO_UPGRADE_POLICY_ENABLED=true'
Shutting down all GPU clients on the current node by disabling their component-specific nodeSelector labels
node/microk8s-node01 labeled
Waiting for the operator-validator to shutdown
pod/nvidia-operator-validator-jpgj8 condition met
Waiting for the container-toolkit to shutdown
pod/nvidia-container-toolkit-daemonset-bkdm6 condition met
Waiting for the device-plugin to shutdown
Waiting for gpu-feature-discovery to shutdown
Waiting for dcgm-exporter to shutdown
Waiting for dcgm to shutdown
Auto eviction of GPU pods on node microk8s-node01 is disabled by the upgrade policy
unbinding device 0000:01:00.0
Auto eviction of GPU pods on node microk8s-node01 is disabled by the upgrade policy
Auto drain of the node microk8s-node01 is disabled by the upgrade policy
Rescheduling all GPU clients on the current node by enabling their component-specific nodeSelector labels
node/microk8s-node01 labeled
Unloading nouveau driver...
Successfully unloaded nouveau driver
DRIVER_ARCH is x86_64
Creating directory NVIDIA-Linux-x86_64-535.129.03
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 535.129.03........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
WARNING: Unable to determine the default X library path. The path /tmp/null/lib will be used, but this path was not detected in the ldconfig(8) cache, and no directory exists at this path, so it is likely that libraries installed there will not be found by the loader.
WARNING: You specified the '--no-kernel-modules' command line option, nvidia-installer will not install any kernel modules as part of this driver installation, and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation. Please ensure that NVIDIA kernel modules matching this driver version are installed separately.
WARNING: This NVIDIA driver package includes Vulkan components, but no Vulkan ICD loader was detected on this system. The NVIDIA Vulkan ICD will not function without the loader. Most distributions package the Vulkan loader; try installing the "vulkan-loader", "vulkan-icd-loader", or "libvulkan1" package.
========== NVIDIA Software Installer ==========
Starting installation of NVIDIA driver version 535.129.03 for Linux kernel version 5.15.0-112-generic
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Updating the package cache...
Resolving Linux kernel version...
Proceeding with Linux kernel version 5.15.0-112-generic
Installing Linux kernel headers...
Installing Linux kernel module files...
Generating Linux kernel version string...
Compiling NVIDIA driver kernel modules...
warning: the compiler differs from the one used to build the kernel
The kernel was built by: gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
You are using: cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
/usr/src/nvidia-535.129.03/kernel/nvidia-uvm/uvm_perf_events_test.c: In function 'test_events':
/usr/src/nvidia-535.129.03/kernel/nvidia-uvm/uvm_perf_events_test.c:83:1: warning: the frame size of 1048 bytes is larger than 1024 bytes [-Wframe-larger-than=]
83 | }
| ^
/usr/src/nvidia-535.129.03/kernel/nvidia-drm/nvidia-drm-crtc.c: In function '__nv_drm_plane_atomic_destroy_state':
/usr/src/nvidia-535.129.03/kernel/nvidia-drm/nvidia-drm-crtc.c:695:5: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
695 | struct nv_drm_plane_state *nv_drm_plane_state =
| ^~~~~~
/usr/src/nvidia-535.129.03/kernel/nvidia-peermem/nvidia-peermem.c: In function 'nv_mem_client_init':
/usr/src/nvidia-535.129.03/kernel/nvidia-peermem/nvidia-peermem.c:490:5: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
490 | int status = 0;
| ^~~
ERROR: modpost: GPL-incompatible module nvidia.ko uses GPL-only symbol 'rcu_read_unlock_strict'
make[2]: *** [scripts/Makefile.modpost:133: /usr/src/nvidia-535.129.03/kernel/Module.symvers] Error 1
make[2]: *** Deleting file '/usr/src/nvidia-535.129.03/kernel/Module.symvers'
make[1]: *** [Makefile:1830: modules] Error 2
make: *** [Makefile:82: modules] Error 2
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
- [ ] Output from running
nvidia-smifrom the driver container:microk8s kubectl exec nvidia-driver-daemonset-99xrx -n gpu-operator-resources -c nvidia-driver-ctr -- nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
command terminated with exit code 9
- [ ] containerd logs
journalctl -u containerd > containerd.log-- No entries --
│ nvidia-driver-ctr /usr/src/nvidia-535.104.12/kernel/nvidia-uvm/uvm_perf_events_test.c: In function 'test_events': │
│ nvidia-driver-ctr /usr/src/nvidia-535.104.12/kernel/nvidia-uvm/uvm_perf_events_test.c:83:1: warning: the frame size of 1048 bytes is larger than 1024 bytes [-Wframe-larger-than=] │
│ nvidia-driver-ctr 83 | } │
│ nvidia-driver-ctr | ^ │
│ nvidia-driver-ctr /usr/src/nvidia-535.104.12/kernel/nvidia-drm/nvidia-drm-crtc.c: In function '__nv_drm_plane_atomic_destroy_state': │
│ nvidia-driver-ctr /usr/src/nvidia-535.104.12/kernel/nvidia-drm/nvidia-drm-crtc.c:695:5: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement] │
│ nvidia-driver-ctr 695 | struct nv_drm_plane_state *nv_drm_plane_state = │
│ nvidia-driver-ctr | ^~~~~~ │
│ nvidia-driver-ctr /usr/src/nvidia-535.104.12/kernel/nvidia-peermem/nvidia-peermem.c: In function 'nv_mem_client_init': │
│ nvidia-driver-ctr /usr/src/nvidia-535.104.12/kernel/nvidia-peermem/nvidia-peermem.c:462:5: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement] │
│ nvidia-driver-ctr 462 | int status = 0; │
│ nvidia-driver-ctr | ^~~ │
│ nvidia-driver-ctr ERROR: modpost: GPL-incompatible module nvidia.ko uses GPL-only symbol 'rcu_read_unlock_strict' │
│ nvidia-driver-ctr make[2]: *** [scripts/Makefile.modpost:133: /usr/src/nvidia-535.104.12/kernel/Module.symvers] Error 1 │
│ nvidia-driver-ctr make[2]: *** Deleting file '/usr/src/nvidia-535.104.12/kernel/Module.symvers' │
│ nvidia-driver-ctr make[1]: *** [Makefile:1830: modules] Error 2 │
│ nvidia-driver-ctr make: *** [Makefile:82: modules] Error 2 │
│ nvidia-driver-ctr Stopping NVIDIA persistence daemon... │
│ nvidia-driver-ctr Unloading NVIDIA driver kernel modules... │
│ nvidia-driver-ctr Unmounting NVIDIA driver rootfs...
I also encountered the same problem.
I found that different NVIDIA driver installed on host nvidia-555. I reinstalled nvidia-535 and it worked. Hope this help
The error ERROR: modpost: GPL-incompatible module nvidia.ko uses GPL-only symbol 'rcu_read_unlock_strict' is a known issue with newer kernels. This issue was fixed with driver versions >= 535.183.08. Closing this issue.
请问一下作者,我也是和你相似的状况,只不过我是不少pod处于init,其中负责安驱动的容器安不上驱动