gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

Driver crash loop on AlmaLinux 9.3 with Helm

Open manuelsardi opened this issue 1 year ago • 0 comments

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04): AlmaLinux 9.3
  • Kernel Version: 5.14.0-362.18.1.el9_3.x86_64
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): CRI-O v1.24.0
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): K8s v1.24.6
  • GPU Operator Version: v23.9.1

2. Issue or feature description

Briefly explain the issue in terms of expected behavior and current behavior. We have been trying to install the operator following the steps for RHEL without success, using the ubi8 based images for the Container Toolkit and rhel8.9 version for the drivers. NFD is preinstalled on the cluster with a list of whitelisted labels that include the ones required for the operator: kernel_config, kernel_version, pci-10de and system-os. The driver is not preinstalled in the node, we want to provide it via operator. The behaviour we are seeing is pretty erratic. The gpu-operator pod is created and runs OK. The rest of the containers (gpu-feature-discovery, nvidia-container-toolkit, nvidia-dcgm-exporter, nvidia-device-plugin, nvidia-driver and nvidia-operator-validator) are stuck on init state. Of those, both nvidia-container-toolkit and nvidia-driver run across a cycle of init, pending and terminating every 5 seconds approx.

This makes it very difficult to debug what's going on. As I understand, all pods depend on nvidia-driver successfully installing the driver and will wait for it to end successfully. On nvidia-driver I have managed to capture only this logs:

========== NVIDIA Software Installer ==========

Getting current value of the 'nvidia.com/gpu.deploy.operator-validator' node label
Current value of 'nvidia.com/gpu.deploy.operator-validator=paused-for-driver-upgrade'
Getting current value of the 'nvidia.com/gpu.deploy.container-toolkit' node label
Current value of 'nvidia.com/gpu.deploy.container-toolkit=paused-for-driver-upgrade'
Getting current value of the 'nvidia.com/gpu.deploy.device-plugin' node label
Current value of 'nvidia.com/gpu.deploy.device-plugin=paused-for-driver-upgrade'
Getting current value of the 'nvidia.com/gpu.deploy.gpu-feature-discovery' node label
Current value of 'nvidia.com/gpu.deploy.gpu-feature-discovery=paused-for-driver-upgrade'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm-exporter' node label
Current value of 'nvidia.com/gpu.deploy.dcgm-exporter=paused-for-driver-upgrade'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm' node label
Current value of 'nvidia.com/gpu.deploy.dcgm=paused-for-driver-upgrade'
Getting current value of the 'nvidia.com/gpu.deploy.mig-manager' node label
Current value of 'nvidia.com/gpu.deploy.mig-manager='
Getting current value of the 'nvidia.com/gpu.deploy.nvsm' node label
Current value of 'nvidia.com/gpu.deploy.nvsm='
Getting current value of the 'nvidia.com/gpu.deploy.sandbox-validator' node label
Current value of 'nvidia.com/gpu.deploy.sandbox-validator='
Getting current value of the 'nvidia.com/gpu.deploy.sandbox-device-plugin' node label
Current value of 'nvidia.com/gpu.deploy.sandbox-device-plugin='
Getting current value of the 'nvidia.com/gpu.deploy.vgpu-device-manager' node label
Current value of 'nvidia.com/gpu.deploy.vgpu-device-manager='
Current value of AUTO_UPGRADE_POLICY_ENABLED=true'
Shutting down all GPU clients on the current node by disabling their component-specific nodeSelector labels
node/gpu-node labeled
Waiting for the operator-validator to shutdown
Waiting for the container-toolkit to shutdown
Waiting for the device-plugin to shutdown
Waiting for gpu-feature-discovery to shutdown
Waiting for dcgm-exporter to shutdown
Waiting for dcgm to shutdown
Auto eviction of GPU pods on node gpu-node is disabled by the upgrade policy
unbinding device 0000:01:00.0
Auto eviction of GPU pods on node gpu-node is disabled by the upgrade policy
Auto drain of the node gpu-node is disabled by the upgrade policy
Rescheduling all GPU clients on the current node by enabling their component-specific nodeSelector labels
node/gpu-node labeled

[The same starts again]

The operator is installed via Helm with the following options:

helm install --wait --generate-name \
     -n nvidia-gpu-operator --create-namespace \
     nvidia/gpu-operator \
     --set toolkit-version=1.14.3-ubi8  --set nfd.enabled=false

For what it's worth, the GPU is recognised by the OS:

lspci -nn | grep VGA
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1080] [10de:1b80] (rev a1)

I have tried removing the nouveau driver since I read it can interfere with propietary driver installation, but got the same result. Any help or idea will be welcome. My next step is to try to build my own driver container from ubi9.

3. Steps to reproduce the issue

Detailed steps to reproduce the issue. Add a node with AlmaLinux 9.3 and a NVIDIA GeForce GTX 1080, install NFD, and the operator via Helm.

4. Information to attach (optional if deemed irrelevant)

  • [x] kubernetes pods status: kubectl get pods -n OPERATOR_NAMESPACE image

  • [x] kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE image

  • [ ] If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME

  • [ ] If a pod/ds is in an error state or pending state kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers

  • [ ] Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi Not possible since the driver restarts constantly.

manuelsardi avatar Mar 08 '24 16:03 manuelsardi