gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

Unable to turn on advanced upgrade controller

Open age9990 opened this issue 10 months ago • 0 comments

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04):Ubuntu20.04
  • Kernel Version:5.15.0-69
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker):crio
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS):K8s
  • GPU Operator Version:v23.9.2 with NvidiaDriver CRD on

2. Issue or feature description

In our cluster, one GPU has disk issue so its status is NotReady. When I turn on advanced upgrade controller by setting driver.upgradePolicy.autoUpgrade to true, the advanced upgrade controller is not enabled, showing the error messages below. I tried to set nvidia.com/gpu-driver-upgrade.skip=true on the broken GPU, the same error occurred. The advanced upgrade controller works as expected when every node is ready in another k8s cluster. However, since some node may be down temporarily, would it be reasonable to bypass broken nodes rather than failed straight away?

GPU Operator error logs: {"level":"error","ts":"2024-03-27T06:00:03.292Z","logger":"controllers.Upgrade","msg":"Failed to build node upgrade state for pod","pod":{"namespace":"gpu-operator","name":"nvidia-gpu-driver-ubuntu20.04-797bd4457c-x4czx"},"error":"unable to get node : resource name may not be empty"} {"level":"error","ts":"2024-03-27T06:00:03.292Z","logger":"controllers.Upgrade","msg":"Failed to build cluster upgrade state","error":"unable to get node : resource name may not be empty"} {"level":"error","ts":"2024-03-27T06:00:03.292Z","msg":"Reconciler error","controller":"upgrade-controller","object":{"name":"cluster-policy"},"namespace":"","name":"cluster-policy","reconcileID":"474846e5-07f9-445a-9107-a452581f1a69","error":"unable to get node : resource name may not be empty"}

age9990 avatar Mar 27 '24 10:03 age9990