gpu-operator install fails with driver pod errors 'Failed to download metadata for repo 'rhel-8-for-x86_64-appstream-rpms''

Open aneesh786 opened this issue 2 years ago • 1 comments

1. Quick Debug Information

OS/Version(e.g. RHEL8.6, Ubuntu22.04): RHEL8.6
Kernel Version: 4.18.0-372.9.1.el8.x86_64
Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): cri-o://1.26.4
K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): K8S 1.27.1
GPU Operator Version: 23.9.x

2. Issue or feature description

Iam trying to install gpu operator using helm. During install, driver pod(nvidia-driver-daemonset-fwcvl) fails with below error below are pod logs -- omitted the initial part and added only error logs.

'[' '' '!=' builtin ']' Updating the package cache...
echo 'Updating the package cache...'
yum -q makecache Error: Failed to download metadata for repo 'rhel-8-for-x86_64-appstream-rpms': Cannot download repomd.xml: Cannot download repodata/repomd.xml: All mirrors were tried FATAL: failed to reach RHEL package repositories. Ensure that the cluster can access the proper networks.
echo 'FATAL: failed to reach RHEL package repositories. ' 'Ensure that the cluster can access the proper networks.'

[ ] kubernetes pods status: kubectl get pods -n gpu-operator gpu-feature-discovery-zqm9h 0/1 Init:0/1 0 86m gpu-operator-1700756391-node-feature-discovery-gc-5c546559bfmj2 1/1 Running 0 93m gpu-operator-1700756391-node-feature-discovery-master-79796bzcb 1/1 Running 0 93m gpu-operator-1700756391-node-feature-discovery-worker-6ddld 1/1 Running 0 93m gpu-operator-1700756391-node-feature-discovery-worker-8c2k4 1/1 Running 0 93m gpu-operator-1700756391-node-feature-discovery-worker-nzd7b 1/1 Running 0 93m gpu-operator-1700756391-node-feature-discovery-worker-x8nx9 1/1 Running 0 93m gpu-operator-68d85f45d-v97fz 1/1 Running 0 93m nvidia-container-toolkit-daemonset-kqmtx 0/1 Init:0/1 0 86m nvidia-dcgm-exporter-5ncg7 0/1 Init:0/1 0 86m nvidia-device-plugin-daemonset-qmvhc 0/1 Init:0/1 0 86m nvidia-driver-daemonset-fwcvl 0/1 CrashLoopBackOff 19 (3m20s ago) 87m nvidia-operator-validator-vcztn 0/1 Init:0/4 0 86m
[ ] kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE gpu-feature-discovery 1 1 0 1 0 nvidia.com/gpu.deploy.gpu-feature-discovery=true 94m gpu-operator-1700756391-node-feature-discovery-worker 4 4 4 4 4 94m nvidia-container-toolkit-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.container-toolkit=true 94m nvidia-dcgm-exporter 1 1 0 1 0 nvidia.com/gpu.deploy.dcgm-exporter=true 94m nvidia-device-plugin-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.device-plugin=true 94m nvidia-driver-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.driver=true 94m nvidia-mig-manager 0 0 0 0 0 nvidia.com/gpu.deploy.mig-manager=true 94m nvidia-operator-validator 1 1 0 1 0 nvidia.com/gpu.deploy.operator-validator=true 94m
[ ] If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME k describe po nvidia-driver-daemonset-fwcvl Name: nvidia-driver-daemonset-fwcvl Namespace: gpu-operator Priority: 2000001000 Priority Class Name: system-node-critical Service Account: nvidia-driver Node: lab-worker-4/172.21.1.70 Start Time: Thu, 23 Nov 2023 11:26:21 -0500 Labels: app=nvidia-driver-daemonset app.kubernetes.io/component=nvidia-driver app.kubernetes.io/managed-by=gpu-operator controller-revision-hash=5954d75477 helm.sh/chart=gpu-operator-v23.9.0 nvidia.com/precompiled=false pod-template-generation=1 Annotations: cni.projectcalico.org/containerID: 14eb92fe162f5d1ddcf0d32343f0815ae1325dfca8eb88354d979f7cbc335c5d cni.projectcalico.org/podIP: 192.168.148.114/32 cni.projectcalico.org/podIPs: 192.168.148.114/32 kubectl.kubernetes.io/default-container: nvidia-driver-ctr Status: Running IP: 192.168.148.114 IPs: IP: 192.168.148.114 Controlled By: DaemonSet/nvidia-driver-daemonset Init Containers: k8s-driver-manager: Container ID: cri-o://b15e393c5603042c1938c49f132a706332ba76bb21dab6ea2d50a0fe2a0cf3b3 Image: nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.6.4 Image ID: nvcr.io/nvidia/cloud-native/k8s-driver-manager@sha256:5ca81f4f7e55f7b304dbbb7aaa235fca2656789145e4b34f47a7ab7079704dc7 Port: Host Port: Command: driver-manager Args: uninstall_driver State: Terminated Reason: Completed Exit Code: 0 Started: Thu, 23 Nov 2023 11:26:22 -0500 Finished: Thu, 23 Nov 2023 11:26:54 -0500 Ready: True Restart Count: 0 Environment: NODE_NAME: (v1:spec.nodeName) NVIDIA_VISIBLE_DEVICES: void ENABLE_GPU_POD_EVICTION: true ENABLE_AUTO_DRAIN: false DRAIN_USE_FORCE: false DRAIN_POD_SELECTOR_LABEL: DRAIN_TIMEOUT_SECONDS: 0s DRAIN_DELETE_EMPTYDIR_DATA: false OPERATOR_NAMESPACE: gpu-operator (v1:metadata.namespace) Mounts: /host from host-root (ro) /run/nvidia from run-nvidia (rw) /sys from host-sys (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qphz2 (ro) Containers: nvidia-driver-ctr: Container ID: cri-o://8139fed89018b0c4382884f44dfa1f7146711824baf3029b9b8b416e4e91c9f5 Image: nvcr.io/nvidia/driver:525.125.06-rhel8.6 Image ID: nvcr.io/nvidia/driver@sha256:b58167d31d34784cd7c425961234d67c5e2d22eb4a5312681d0337dae812f746 Port: Host Port: Command: nvidia-driver Args: init State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Exit Code: 1 Started: Thu, 23 Nov 2023 12:49:50 -0500 Finished: Thu, 23 Nov 2023 12:50:24 -0500 Ready: False Restart Count: 19 Startup: exec [sh -c nvidia-smi && touch /run/nvidia/validations/.driver-ctr-ready] delay=60s timeout=60s period=10s #success=1 #failure=120 Environment: Mounts: /dev/log from dev-log (rw) /host-etc/os-release from host-os-release (ro) /run/mellanox/drivers from run-mellanox-drivers (rw) /run/mellanox/drivers/usr/src from mlnx-ofed-usr-src (rw) /run/nvidia from run-nvidia (rw) /run/nvidia-topologyd from run-nvidia-topologyd (rw) /var/log from var-log (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qphz2 (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: run-nvidia: Type: HostPath (bare host directory volume) Path: /run/nvidia HostPathType: DirectoryOrCreate var-log: Type: HostPath (bare host directory volume) Path: /var/log HostPathType: dev-log: Type: HostPath (bare host directory volume) Path: /dev/log HostPathType: host-os-release: Type: HostPath (bare host directory volume) Path: /etc/os-release HostPathType: run-nvidia-topologyd: Type: HostPath (bare host directory volume) Path: /run/nvidia-topologyd HostPathType: DirectoryOrCreate mlnx-ofed-usr-src: Type: HostPath (bare host directory volume) Path: /run/mellanox/drivers/usr/src HostPathType: DirectoryOrCreate run-mellanox-drivers: Type: HostPath (bare host directory volume) Path: /run/mellanox/drivers HostPathType: DirectoryOrCreate run-nvidia-validations: Type: HostPath (bare host directory volume) Path: /run/nvidia/validations HostPathType: DirectoryOrCreate host-root: Type: HostPath (bare host directory volume) Path: / HostPathType: host-sys: Type: HostPath (bare host directory volume) Path: /sys HostPathType: Directory kube-api-access-qphz2: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: DownwardAPI: true QoS Class: BestEffort Node-Selectors: nvidia.com/gpu.deploy.driver=true Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists node.kubernetes.io/pid-pressure:NoSchedule op=Exists node.kubernetes.io/unreachable:NoExecute op=Exists node.kubernetes.io/unschedulable:NoSchedule op=Exists nvidia.com/gpu:NoSchedule op=Exists Events: Type Reason Age From Message

Warning BackOff 3m53s (x350 over 87m) kubelet Back-off restarting failed container nvidia-driver-ctr in pod nvidia-driver-daemonset-fwcvl_gpu-operator(1ab5bc39-dd70-411f-9592-a6b5b69ff723)

any help on this issue will be very much appreciated

Nov 23 '23 18:11 aneesh786

Help!!!

Nov 27 '23 04:11 aneesh786