gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

Enabling gpu on microk8s, pod/nvidia-driver-daemonset restart many times at status CrashLoopBackOff

Open haiph-dev opened this issue 1 year ago • 2 comments

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu 22.04.4 LTS
  • Kernel Version: 5.15.0-112-generic
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): containerd://1.6.28
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): microk8s

2. Issue or feature description

Enabling gpu on microk8s, pod/nvidia-driver-daemonset restart many times at status CrashLoopBackOff

3. Steps to reproduce the issue

I follow instructions from https://www.nvidia.com/en-us/on-demand/session/gtcspring21-ss33138/ to install microk8s on Ubuntu 22.04. Instructions mentioned that not install nvidia driver. I tried both but the same result.

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.5.0/local_installers/cuda-repo-ubuntu2204-12-5-local_12.5.0-555.42.02-1_amd64.deb
dpkg -i cuda-repo-ubuntu2204-12-5-local_12.5.0-555.42.02-1_amd64.deb
cp /var/cuda-repo-ubuntu2204-12-5-local/cuda-*-keyring.gpg /usr/share/keyrings/
apt-get update
apt install nvidia-fabricmanager-555
snap install microk8s --classic --channel=1.30/stable
microk8s enable gpu

4. Information to attach (optional if deemed irrelevant)

  • [ ] kubernetes pods status: microk8s kubectl get pods -n gpu-operator-resources
 NAME                                                          READY   STATUS     RESTARTS        AGE
gpu-feature-discovery-7c8n9                                   0/1     Init:0/1   0               28m
gpu-operator-56b6cf869d-fj2jf                                 1/1     Running    0               29m
gpu-operator-node-feature-discovery-gc-5fcdc8894b-4688x       1/1     Running    0               29m
gpu-operator-node-feature-discovery-master-7d84b856d7-829tk   1/1     Running    0               29m
gpu-operator-node-feature-discovery-worker-k2szk              1/1     Running    0               29m
nvidia-container-toolkit-daemonset-blswt                      0/1     Init:0/1   0               28m
nvidia-dcgm-exporter-5m454                                    0/1     Init:0/1   0               28m
nvidia-device-plugin-daemonset-7xcjh                          0/1     Init:0/1   0               28m
nvidia-driver-daemonset-99xrx                                 0/1     Running    6 (6m48s ago)   28m
nvidia-operator-validator-msxd8                               0/1     Init:0/4   0               28m
  • [ ] kubernetes daemonset status: microk8s kubectl get ds -n gpu-operator-resources
 NAME                                         DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                      AGE
gpu-feature-discovery                        1         1         0       1            0           nvidia.com/gpu.deploy.gpu-feature-discovery=true   31m
gpu-operator-node-feature-discovery-worker   1         1         1       1            1           <none>                                             31m
nvidia-container-toolkit-daemonset           1         1         0       1            0           nvidia.com/gpu.deploy.container-toolkit=true       31m
nvidia-dcgm-exporter                         1         1         0       1            0           nvidia.com/gpu.deploy.dcgm-exporter=true           31m
nvidia-device-plugin-daemonset               1         1         0       1            0           nvidia.com/gpu.deploy.device-plugin=true           31m
nvidia-driver-daemonset                      1         1         0       1            0           nvidia.com/gpu.deploy.driver=true                  31m
nvidia-mig-manager                           0         0         0       0            0           nvidia.com/gpu.deploy.mig-manager=true             31m
nvidia-operator-validator                    1         1         0       1            0           nvidia.com/gpu.deploy.operator-validator=true      31m
  • [ ] If a pod/ds is in an error state or pending state microk8s kubectl describe pod -n gpu-operator-resources nvidia-driver-daemonset
 Name:                 nvidia-driver-daemonset-99xrx
Namespace:            gpu-operator-resources
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      nvidia-driver
Node:                 microk8s-node01/10.64.43.201
Start Time:           Wed, 12 Jun 2024 03:29:39 +0000
Labels:               app=nvidia-driver-daemonset
                      app.kubernetes.io/component=nvidia-driver
                      app.kubernetes.io/managed-by=gpu-operator
                      controller-revision-hash=7974d7cccc
                      helm.sh/chart=gpu-operator-v23.9.1
                      nvidia.com/precompiled=false
                      pod-template-generation=1
Annotations:          cni.projectcalico.org/containerID: ed6142941d825196ff2e68a23c23f4daab594989742c6315b958541ffbb9a04a
                      cni.projectcalico.org/podIP: 10.1.47.71/32
                      cni.projectcalico.org/podIPs: 10.1.47.71/32
                      kubectl.kubernetes.io/default-container: nvidia-driver-ctr
Status:               Running
IP:                   10.1.47.71
IPs:
  IP:           10.1.47.71
Controlled By:  DaemonSet/nvidia-driver-daemonset
Init Containers:
  k8s-driver-manager:
    Container ID:  containerd://158252567d0e2716f25528c4adb609600944f79879ae42bb3e898fb63aeaba79
    Image:         nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.6.5
    Image ID:      nvcr.io/nvidia/cloud-native/k8s-driver-manager@sha256:27c44f4720a4abf780217bd5e7903e4a008ebdbcf71238c4f106a0c22654776c
    Port:          <none>
    Host Port:     <none>
    Command:
      driver-manager
    Args:
      uninstall_driver
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 12 Jun 2024 03:29:57 +0000
      Finished:     Wed, 12 Jun 2024 03:29:59 +0000
    Ready:          True
    Restart Count:  0
    Environment:
      NODE_NAME:                    (v1:spec.nodeName)
      NVIDIA_VISIBLE_DEVICES:      void
      ENABLE_GPU_POD_EVICTION:     true
      ENABLE_AUTO_DRAIN:           false
      DRAIN_USE_FORCE:             false
      DRAIN_POD_SELECTOR_LABEL:    
      DRAIN_TIMEOUT_SECONDS:       0s
      DRAIN_DELETE_EMPTYDIR_DATA:  false
      OPERATOR_NAMESPACE:          gpu-operator-resources (v1:metadata.namespace)
    Mounts:
      /host from host-root (ro)
      /run/nvidia from run-nvidia (rw)
      /sys from host-sys (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4rjhs (ro)
Containers:
  nvidia-driver-ctr:
    Container ID:  containerd://f84013acb0c49a4a7a45b691c49d42637f761e6342f1e821e37de9d447c60d0b
    Image:         nvcr.io/nvidia/driver:535.129.03-ubuntu22.04
    Image ID:      nvcr.io/nvidia/driver@sha256:3981d34191e355a8c96a926f4b00254dba41f89def7ed2c853e681a72e3f14eb
    Port:          <none>
    Host Port:     <none>
    Command:
      nvidia-driver
    Args:
      init
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Wed, 12 Jun 2024 03:54:34 +0000
      Finished:     Wed, 12 Jun 2024 03:59:04 +0000
    Ready:          False
    Restart Count:  6
    Startup:        exec [sh -c nvidia-smi && touch /run/nvidia/validations/.driver-ctr-ready] delay=60s timeout=60s period=10s #success=1 #failure=120
    Environment:    <none>
    Mounts:
      /dev/log from dev-log (rw)
      /host-etc/os-release from host-os-release (ro)
      /lib/firmware from nv-firmware (rw)
      /run/mellanox/drivers from run-mellanox-drivers (rw)
      /run/mellanox/drivers/usr/src from mlnx-ofed-usr-src (rw)
      /run/nvidia from run-nvidia (rw)
      /run/nvidia-topologyd from run-nvidia-topologyd (rw)
      /sys/devices/system/memory/auto_online_blocks from sysfs-memory-online (rw)
      /sys/module/firmware_class/parameters/path from firmware-search-path (rw)
      /var/log from var-log (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4rjhs (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  run-nvidia:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia
    HostPathType:  DirectoryOrCreate
  var-log:
    Type:          HostPath (bare host directory volume)
    Path:          /var/log
    HostPathType:  
  dev-log:
    Type:          HostPath (bare host directory volume)
    Path:          /dev/log
    HostPathType:  
  host-os-release:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/os-release
    HostPathType:  
  run-nvidia-topologyd:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia-topologyd
    HostPathType:  DirectoryOrCreate
  mlnx-ofed-usr-src:
    Type:          HostPath (bare host directory volume)
    Path:          /run/mellanox/drivers/usr/src
    HostPathType:  DirectoryOrCreate
  run-mellanox-drivers:
    Type:          HostPath (bare host directory volume)
    Path:          /run/mellanox/drivers
    HostPathType:  DirectoryOrCreate
  run-nvidia-validations:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/validations
    HostPathType:  DirectoryOrCreate
  host-root:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:  
  host-sys:
    Type:          HostPath (bare host directory volume)
    Path:          /sys
    HostPathType:  Directory
  firmware-search-path:
    Type:          HostPath (bare host directory volume)
    Path:          /sys/module/firmware_class/parameters/path
    HostPathType:  
  sysfs-memory-online:
    Type:          HostPath (bare host directory volume)
    Path:          /sys/devices/system/memory/auto_online_blocks
    HostPathType:  
  nv-firmware:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/driver/lib/firmware
    HostPathType:  DirectoryOrCreate
  kube-api-access-4rjhs:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              nvidia.com/gpu.deploy.driver=true
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason   Age                   From     Message
  ----     ------   ----                  ----     -------
  Warning  BackOff  3m11s (x34 over 27m)  kubelet  Back-off restarting failed container nvidia-driver-ctr in pod nvidia-driver-daemonset-99xrx_gpu-operator-resources(5c2ab5b3-bcae-466e-ba84-46acbc55cc41)
  • [ ] If a pod/ds is in an error state or pending state microk8s kubectl logs -n gpu-operator-resources nvidia-driver-daemonset-99xrx --all-containers
Getting current value of the 'nvidia.com/gpu.deploy.operator-validator' node label
Current value of 'nvidia.com/gpu.deploy.operator-validator=true'
Getting current value of the 'nvidia.com/gpu.deploy.container-toolkit' node label
Current value of 'nvidia.com/gpu.deploy.container-toolkit=true'
Getting current value of the 'nvidia.com/gpu.deploy.device-plugin' node label
Current value of 'nvidia.com/gpu.deploy.device-plugin=true'
Getting current value of the 'nvidia.com/gpu.deploy.gpu-feature-discovery' node label
Current value of 'nvidia.com/gpu.deploy.gpu-feature-discovery=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm-exporter' node label
Current value of 'nvidia.com/gpu.deploy.dcgm-exporter=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm' node label
Current value of 'nvidia.com/gpu.deploy.dcgm=true'
Getting current value of the 'nvidia.com/gpu.deploy.mig-manager' node label
Current value of 'nvidia.com/gpu.deploy.mig-manager='
Getting current value of the 'nvidia.com/gpu.deploy.nvsm' node label
Current value of 'nvidia.com/gpu.deploy.nvsm='
Getting current value of the 'nvidia.com/gpu.deploy.sandbox-validator' node label
Current value of 'nvidia.com/gpu.deploy.sandbox-validator='
Getting current value of the 'nvidia.com/gpu.deploy.sandbox-device-plugin' node label
Current value of 'nvidia.com/gpu.deploy.sandbox-device-plugin='
Getting current value of the 'nvidia.com/gpu.deploy.vgpu-device-manager' node label
Current value of 'nvidia.com/gpu.deploy.vgpu-device-manager='
Current value of AUTO_UPGRADE_POLICY_ENABLED=true'
Shutting down all GPU clients on the current node by disabling their component-specific nodeSelector labels
node/microk8s-node01 labeled
Waiting for the operator-validator to shutdown
pod/nvidia-operator-validator-jpgj8 condition met
Waiting for the container-toolkit to shutdown
pod/nvidia-container-toolkit-daemonset-bkdm6 condition met
Waiting for the device-plugin to shutdown
Waiting for gpu-feature-discovery to shutdown
Waiting for dcgm-exporter to shutdown
Waiting for dcgm to shutdown
Auto eviction of GPU pods on node microk8s-node01 is disabled by the upgrade policy
unbinding device 0000:01:00.0
Auto eviction of GPU pods on node microk8s-node01 is disabled by the upgrade policy
Auto drain of the node microk8s-node01 is disabled by the upgrade policy
Rescheduling all GPU clients on the current node by enabling their component-specific nodeSelector labels
node/microk8s-node01 labeled
Unloading nouveau driver...
Successfully unloaded nouveau driver
DRIVER_ARCH is x86_64
Creating directory NVIDIA-Linux-x86_64-535.129.03
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 535.129.03........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

WARNING: Unable to determine the default X library path. The path /tmp/null/lib will be used, but this path was not detected in the ldconfig(8) cache, and no directory exists at this path, so it is likely that libraries installed there will not be found by the loader.


WARNING: You specified the '--no-kernel-modules' command line option, nvidia-installer will not install any kernel modules as part of this driver installation, and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation.  Please ensure that NVIDIA kernel modules matching this driver version are installed separately.


WARNING: This NVIDIA driver package includes Vulkan components, but no Vulkan ICD loader was detected on this system. The NVIDIA Vulkan ICD will not function without the loader. Most distributions package the Vulkan loader; try installing the "vulkan-loader", "vulkan-icd-loader", or "libvulkan1" package.


========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver version 535.129.03 for Linux kernel version 5.15.0-112-generic

Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Updating the package cache...
Resolving Linux kernel version...
Proceeding with Linux kernel version 5.15.0-112-generic
Installing Linux kernel headers...
Installing Linux kernel module files...
Generating Linux kernel version string...
Compiling NVIDIA driver kernel modules...
warning: the compiler differs from the one used to build the kernel
  The kernel was built by: gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
  You are using:           cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
/usr/src/nvidia-535.129.03/kernel/nvidia-uvm/uvm_perf_events_test.c: In function 'test_events':
/usr/src/nvidia-535.129.03/kernel/nvidia-uvm/uvm_perf_events_test.c:83:1: warning: the frame size of 1048 bytes is larger than 1024 bytes [-Wframe-larger-than=]
   83 | }
      | ^
/usr/src/nvidia-535.129.03/kernel/nvidia-drm/nvidia-drm-crtc.c: In function '__nv_drm_plane_atomic_destroy_state':
/usr/src/nvidia-535.129.03/kernel/nvidia-drm/nvidia-drm-crtc.c:695:5: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
  695 |     struct nv_drm_plane_state *nv_drm_plane_state =
      |     ^~~~~~
/usr/src/nvidia-535.129.03/kernel/nvidia-peermem/nvidia-peermem.c: In function 'nv_mem_client_init':
/usr/src/nvidia-535.129.03/kernel/nvidia-peermem/nvidia-peermem.c:490:5: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
  490 |     int status = 0;
      |     ^~~
ERROR: modpost: GPL-incompatible module nvidia.ko uses GPL-only symbol 'rcu_read_unlock_strict'
make[2]: *** [scripts/Makefile.modpost:133: /usr/src/nvidia-535.129.03/kernel/Module.symvers] Error 1
make[2]: *** Deleting file '/usr/src/nvidia-535.129.03/kernel/Module.symvers'
make[1]: *** [Makefile:1830: modules] Error 2
make: *** [Makefile:82: modules] Error 2
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
  • [ ] Output from running nvidia-smi from the driver container: microk8s kubectl exec nvidia-driver-daemonset-99xrx -n gpu-operator-resources -c nvidia-driver-ctr -- nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

command terminated with exit code 9
  • [ ] containerd logs journalctl -u containerd > containerd.log -- No entries --

haiph-dev avatar Jun 12 '24 04:06 haiph-dev

│ nvidia-driver-ctr /usr/src/nvidia-535.104.12/kernel/nvidia-uvm/uvm_perf_events_test.c: In function 'test_events':                                                                                                                                                                                                        │
│ nvidia-driver-ctr /usr/src/nvidia-535.104.12/kernel/nvidia-uvm/uvm_perf_events_test.c:83:1: warning: the frame size of 1048 bytes is larger than 1024 bytes [-Wframe-larger-than=]                                                                                                                                       │
│ nvidia-driver-ctr    83 | }                                                                                                                                                                                                                                                                                              │
│ nvidia-driver-ctr       | ^                                                                                                                                                                                                                                                                                              │
│ nvidia-driver-ctr /usr/src/nvidia-535.104.12/kernel/nvidia-drm/nvidia-drm-crtc.c: In function '__nv_drm_plane_atomic_destroy_state':                                                                                                                                                                                     │
│ nvidia-driver-ctr /usr/src/nvidia-535.104.12/kernel/nvidia-drm/nvidia-drm-crtc.c:695:5: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]                                                                                                                                             │
│ nvidia-driver-ctr   695 |     struct nv_drm_plane_state *nv_drm_plane_state =                                                                                                                                                                                                                                            │
│ nvidia-driver-ctr       |     ^~~~~~                                                                                                                                                                                                                                                                                     │
│ nvidia-driver-ctr /usr/src/nvidia-535.104.12/kernel/nvidia-peermem/nvidia-peermem.c: In function 'nv_mem_client_init':                                                                                                                                                                                                   │
│ nvidia-driver-ctr /usr/src/nvidia-535.104.12/kernel/nvidia-peermem/nvidia-peermem.c:462:5: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]                                                                                                                                          │
│ nvidia-driver-ctr   462 |     int status = 0;                                                                                                                                                                                                                                                                            │
│ nvidia-driver-ctr       |     ^~~                                                                                                                                                                                                                                                                                        │
│ nvidia-driver-ctr ERROR: modpost: GPL-incompatible module nvidia.ko uses GPL-only symbol 'rcu_read_unlock_strict'                                                                                                                                                                                                        │
│ nvidia-driver-ctr make[2]: *** [scripts/Makefile.modpost:133: /usr/src/nvidia-535.104.12/kernel/Module.symvers] Error 1                                                                                                                                                                                                  │
│ nvidia-driver-ctr make[2]: *** Deleting file '/usr/src/nvidia-535.104.12/kernel/Module.symvers'                                                                                                                                                                                                                          │
│ nvidia-driver-ctr make[1]: *** [Makefile:1830: modules] Error 2                                                                                                                                                                                                                                                          │
│ nvidia-driver-ctr make: *** [Makefile:82: modules] Error 2                                                                                                                                                                                                                                                               │
│ nvidia-driver-ctr Stopping NVIDIA persistence daemon...                                                                                                                                                                                                                                                                  │
│ nvidia-driver-ctr Unloading NVIDIA driver kernel modules...                                                                                                                                                                                                                                                              │
│ nvidia-driver-ctr Unmounting NVIDIA driver rootfs...

I also encountered the same problem.

chaunceyjiang avatar Jun 18 '24 03:06 chaunceyjiang

I found that different NVIDIA driver installed on host nvidia-555. I reinstalled nvidia-535 and it worked. Hope this help

haiph-dev avatar Jun 18 '24 06:06 haiph-dev

The error ERROR: modpost: GPL-incompatible module nvidia.ko uses GPL-only symbol 'rcu_read_unlock_strict' is a known issue with newer kernels. This issue was fixed with driver versions >= 535.183.08. Closing this issue.

cdesiniotis avatar Jul 11 '24 23:07 cdesiniotis

请问一下作者,我也是和你相似的状况,只不过我是不少pod处于init,其中负责安驱动的容器安不上驱动

452256 avatar Aug 26 '24 05:08 452256