nvidia-peermem-ctr: /usr/local/bin/nvidia-driver: line 769: RHEL_VERSION: unbound variable

Open takeshi-yoshimura opened this issue 2 years ago • 2 comments

1. Quick Debug Information

OS/Version(e.g. RHEL8.6, Ubuntu22.04): RHCOS4.13
Kernel Version: 4.18.0-372.59.1.el8_6.x86_64
Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): CRI-O
K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): OCP
GPU Operator Version: v23.9.0

2. Issue or feature description

Container nvidia-peermem-ctr in pod nvidia-driver-daemonset crashed. As I post in the below, the log said RHEL_VERSION was not set. I think the container should have mounted /etc/os-release so that it can inspect RHEL_VERSION like other containers in the same pod. The failure was at the line DNF_RELEASEVER="${RHEL_VERSION}" in /usr/local/bin/nvidia-driver.

3. Steps to reproduce the issue

I just installed the gpu operator recently with mostly default setting on a RHCOS4.13/OpenShift4.12 cluster (spec.driver.rdma.enabled=true and spec.driver.rdma.useHostMofed=false).

My workaround was just to specifying driver version to be older one (535.104.05) instead of the latest one (535.104.12?) in spec.driver.version in my clusterpolicy.

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: gpu-cluster-policy
spec:
  driver:
     version: 535.104.05
     image: driver
     repository: nvcr.io/nvidia
...

4. Information to attach (optional if deemed irrelevant)

[x] kubernetes pods status: kubectl get pods -n OPERATOR_NAMESPACE
[x] kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE
[x] If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
[x] If a pod/ds is in an error state or pending state kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
[x] Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi

kubectl get po:

NAME                                                   READY   STATUS             RESTARTS          AGE
console-plugin-nvidia-gpu-5df7b85d4-m7knm              1/1     Running            0                 10d
gpu-feature-discovery-74jf4                            1/1     Running            0                 3h45m
gpu-feature-discovery-d9bfz                            1/1     Running            0                 2d19h
gpu-feature-discovery-k6q22                            1/1     Running            0                 37m
gpu-operator-6d95d776d6-bvdng                          1/1     Running            0                 2d20h
grafana-deployment-6b4f9fcc9d-4dkcg                    1/1     Running            0                 10d
grafana-operator-controller-manager-595c7978b9-bq95m   2/2     Running            0                 2d20h
nvidia-container-toolkit-daemonset-6hpr8               1/1     Running            0                 37m
nvidia-container-toolkit-daemonset-d76d8               1/1     Running            0                 2d19h
nvidia-container-toolkit-daemonset-wk65s               1/1     Running            0                 3h45m
nvidia-cuda-validator-f5dvk                            0/1     Completed          0                 37m
nvidia-cuda-validator-mjhhf                            0/1     Completed          0                 3h43m
nvidia-dcgm-2bj8w                                      1/1     Running            0                 3h45m
nvidia-dcgm-dm22g                                      1/1     Running            0                 37m
nvidia-dcgm-exporter-cgc6f                             1/1     Running            0                 2d19h
nvidia-dcgm-exporter-jvscf                             1/1     Running            0                 37m
nvidia-dcgm-exporter-m2jqz                             1/1     Running            0                 3h45m
nvidia-dcgm-p5mds                                      1/1     Running            0                 2d19h
nvidia-device-plugin-daemonset-6jsmf                   1/1     Running            0                 3h45m
nvidia-device-plugin-daemonset-jlkw8                   1/1     Running            0                 37m
nvidia-device-plugin-daemonset-rx7hv                   1/1     Running            0                 2d19h
nvidia-driver-daemonset-412.86.202306132230-0-fkjbj    2/3     CrashLoopBackOff   793 (69s ago)     2d19h
nvidia-driver-daemonset-412.86.202306132230-0-jnt77    2/3     CrashLoopBackOff   792 (4m45s ago)   2d19h
nvidia-driver-daemonset-412.86.202306132230-0-w9mv5    2/3     CrashLoopBackOff   48 (4m48s ago)    3h53m
nvidia-mig-manager-2nqpt                               1/1     Running            0                 3h42m
nvidia-mig-manager-b74w5                               1/1     Running            0                 2d19h
nvidia-mig-manager-hgb5r                               1/1     Running            0                 37m
nvidia-node-status-exporter-xbdc7                      1/1     Running            0                 2d19h
nvidia-node-status-exporter-xg627                      1/1     Running            0                 2d19h
nvidia-node-status-exporter-z56bw                      1/1     Running            0                 3h53m
nvidia-operator-validator-d2wjd                        1/1     Running            0                 2d19h
nvidia-operator-validator-j8cmt                        1/1     Running            0                 3h45m
nvidia-operator-validator-lf8tm                        1/1     Running            0                 37m

oc logs nvidia-driver-daemonset-412.86.202306132230-0 -c nvidia-peermem-ctr

+ set -eu
+ RUN_DIR=/run/nvidia
+ PID_FILE=/run/nvidia/nvidia-driver.pid
+ DRIVER_VERSION=535.104.12
+ KERNEL_UPDATE_HOOK=/run/kernel/postinst.d/update-nvidia-driver
+ NUM_VGPU_DEVICES=0
+ NVIDIA_MODULE_PARAMS=()
+ NVIDIA_UVM_MODULE_PARAMS=()
+ NVIDIA_MODESET_MODULE_PARAMS=()
+ NVIDIA_PEERMEM_MODULE_PARAMS=()
+ TARGETARCH=amd64
+ USE_HOST_MOFED=false
+ DNF_RELEASEVER=
+ OPENSHIFT_VERSION=
+ DRIVER_ARCH=x86_64
+ DRIVER_ARCH=x86_64
+ echo 'DRIVER_ARCH is x86_64'
DRIVER_ARCH is x86_64
+++ dirname -- /usr/local/bin/nvidia-driver
++ cd -- /usr/local/bin
++ pwd
+ SCRIPT_DIR=/usr/local/bin
+ source /usr/local/bin/common.sh
++ GPU_DIRECT_RDMA_ENABLED=false
++ GDS_ENABLED=false
+ '[' 1 -eq 0 ']'
+ command=reload_nvidia_peermem
+ shift
+ case "${command}" in
+ options=
+ '[' 0 -ne 0 ']'
+ eval set -- ''
++ set --
+ ACCEPT_LICENSE=
++ uname -r
+ KERNEL_VERSION=4.18.0-372.59.1.el8_6.x86_64
+ PRIVATE_KEY=
+ PACKAGE_TAG=
+ '[' 0 -ne 0 ']'
+ [[ -z '' ]]
+ _resolve_rhel_version
+ '[' -f /host-etc/os-release ']'
+ return 0
/usr/local/bin/nvidia-driver: line 769: RHEL_VERSION: unbound variable

oc describe po nvidia-driver-daemonset-412.86.202306132230-0

Name:                 nvidia-driver-daemonset-412.86.202306132230-0-fkjbj
Namespace:            nvidia-gpu-operator
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      nvidia-driver
Node:                 gdr-test-6p2kc-gdr-gpu-il-worker-3-jqpk7/10.241.128.27
Start Time:           Sat, 11 Nov 2023 13:46:21 +0900
Labels:               app=nvidia-driver-daemonset-412.86.202306132230-0
                      app.kubernetes.io/component=nvidia-driver
                      controller-revision-hash=56f9b89d7c
                      nvidia.com/precompiled=false
                      openshift.driver-toolkit=true
                      pod-template-generation=1
Annotations:          k8s.ovn.org/pod-networks:
                        {"default":{"ip_addresses":["10.130.4.10/23"],"mac_address":"0a:58:0a:82:04:0a","gateway_ips":["10.130.4.1"],"ip_address":"10.130.4.10/23"...
                      k8s.v1.cni.cncf.io/network-status:
                        [{
                            "name": "ovn-kubernetes",
                            "interface": "eth0",
                            "ips": [
                                "10.130.4.10"
                            ],
                            "mac": "0a:58:0a:82:04:0a",
                            "default": true,
                            "dns": {}
                        }]
                      k8s.v1.cni.cncf.io/networks-status:
                        [{
                            "name": "ovn-kubernetes",
                            "interface": "eth0",
                            "ips": [
                                "10.130.4.10"
                            ],
                            "mac": "0a:58:0a:82:04:0a",
                            "default": true,
                            "dns": {}
                        }]
                      kubectl.kubernetes.io/default-container: nvidia-driver-ctr
                      openshift.io/scc: nvidia-driver
Status:               Running
IP:                   10.130.4.10
IPs:
  IP:           10.130.4.10
Controlled By:  DaemonSet/nvidia-driver-daemonset-412.86.202306132230-0
Init Containers:
  mofed-validation:
    Container ID:  cri-o://643d541ca8e6a969364807d44f564bd1d92a0bacf23fcf73e6a23d17aa3b36e6
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:c3fc8ab2d39d970e3d1a1b0ef16b06792d23cc87be68ed4927c7384ddd1f43cb
    Image ID:      nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:47a658fa7102d99a5dd9fe05f2a5b872deab266138e7955a14ba59e33095738d
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Sat, 11 Nov 2023 13:47:08 +0900
      Finished:     Sat, 11 Nov 2023 13:53:18 +0900
    Ready:          True
    Restart Count:  0
    Environment:
      WITH_WAIT:                true
      COMPONENT:                mofed
      NODE_NAME:                 (v1:spec.nodeName)
      NVIDIA_VISIBLE_DEVICES:   void
      GPU_DIRECT_RDMA_ENABLED:  true
    Mounts:
      /run/mellanox/drivers from run-mellanox-drivers (rw)
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-tsfd2 (ro)
  k8s-driver-manager:
    Container ID:  cri-o://d6dabf0b91a9bef8048c2c2c6da3dd51008e1ef5f58e607b9164b636f15411b6
    Image:         nvcr.io/nvidia/cloud-native/k8s-driver-manager@sha256:a360ed5b1335436ef61cd601fa776e6d03f15f76aeaa8d88bd1506edd93843dc
    Image ID:      nvcr.io/nvidia/cloud-native/k8s-driver-manager@sha256:5ca81f4f7e55f7b304dbbb7aaa235fca2656789145e4b34f47a7ab7079704dc7
    Port:          <none>
    Host Port:     <none>
    Command:
      driver-manager
    Args:
      uninstall_driver
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Sat, 11 Nov 2023 13:53:38 +0900
      Finished:     Sat, 11 Nov 2023 13:54:11 +0900
    Ready:          True
    Restart Count:  0
    Environment:
      NODE_NAME:                    (v1:spec.nodeName)
      NVIDIA_VISIBLE_DEVICES:      void
      ENABLE_GPU_POD_EVICTION:     true
      ENABLE_AUTO_DRAIN:           true
      DRAIN_USE_FORCE:             false
      DRAIN_POD_SELECTOR_LABEL:    
      DRAIN_TIMEOUT_SECONDS:       0s
      DRAIN_DELETE_EMPTYDIR_DATA:  false
      OPERATOR_NAMESPACE:          nvidia-gpu-operator (v1:metadata.namespace)
    Mounts:
      /host from host-root (ro)
      /run/nvidia from run-nvidia (rw)
      /sys from host-sys (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-tsfd2 (ro)
Containers:
  nvidia-driver-ctr:
    Container ID:  cri-o://fcbeb32582dde85e6275cd869abac95ba9df285bf004d5d2a3a763a2465bc82f
    Image:         nvcr.io/nvidia/driver@sha256:7c2df95df9ed4d16ad3b3c84079ccdd161f3639527ac1d90b106217f9f0a3aad
    Image ID:      nvcr.io/nvidia/driver@sha256:00d2137e198eeb72dd972494e2a651e1f67556fcb1f5a93650868f5b2115de8d
    Port:          <none>
    Host Port:     <none>
    Command:
      ocp_dtk_entrypoint
    Args:
      nv-ctr-run-with-dtk
    State:          Running
      Started:      Sat, 11 Nov 2023 13:54:32 +0900
    Ready:          True
    Restart Count:  0
    Startup:        exec [sh -c nvidia-smi && touch /run/nvidia/validations/.driver-ctr-ready] delay=60s timeout=60s period=10s #success=1 #failure=120
    Environment:
      GPU_DIRECT_RDMA_ENABLED:  true
      OPENSHIFT_VERSION:        4.12
    Mounts:
      /dev/log from dev-log (rw)
      /host-etc/os-release from host-os-release (ro)
      /mnt/shared-nvidia-driver-toolkit from shared-nvidia-driver-toolkit (rw)
      /run/mellanox/drivers from run-mellanox-drivers (rw)
      /run/mellanox/drivers/usr/src from mlnx-ofed-usr-src (rw)
      /run/nvidia from run-nvidia (rw)
      /run/nvidia-topologyd from run-nvidia-topologyd (rw)
      /var/log from var-log (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-tsfd2 (ro)
  nvidia-peermem-ctr:
    Container ID:  cri-o://b35eb3fd675b9c5125fd486bd44fe0fdb20eceb7d9a6c14d8511a9d738cb7db0
    Image:         nvcr.io/nvidia/driver@sha256:7c2df95df9ed4d16ad3b3c84079ccdd161f3639527ac1d90b106217f9f0a3aad
    Image ID:      nvcr.io/nvidia/driver@sha256:00d2137e198eeb72dd972494e2a651e1f67556fcb1f5a93650868f5b2115de8d
    Port:          <none>
    Host Port:     <none>
    Command:
      nvidia-driver
    Args:
      reload_nvidia_peermem
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 14 Nov 2023 09:11:17 +0900
      Finished:     Tue, 14 Nov 2023 09:11:17 +0900
    Ready:          False
    Restart Count:  793
    Liveness:       exec [sh -c nvidia-driver probe_nvidia_peermem] delay=30s timeout=10s period=30s #success=1 #failure=1
    Startup:        exec [sh -c nvidia-driver probe_nvidia_peermem] delay=10s timeout=10s period=10s #success=1 #failure=120
    Environment:    <none>
    Mounts:
      /dev/log from dev-log (ro)
      /run/mellanox/drivers from run-mellanox-drivers (rw)
      /run/nvidia from run-nvidia (rw)
      /var/log from var-log (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-tsfd2 (ro)
  openshift-driver-toolkit-ctr:
    Container ID:  cri-o://d61b7fd1fb0e713c16464b2db64713cc2dca8a6b047f86f46501ee1317f9f41e
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:76662154f549f1edde1b61aeebee11b5e23ea3c4809551532c2edcd6ad1993db
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:76662154f549f1edde1b61aeebee11b5e23ea3c4809551532c2edcd6ad1993db
    Port:          <none>
    Host Port:     <none>
    Command:
      bash
      -xc
    Args:
      until [ -f /mnt/shared-nvidia-driver-toolkit/dir_prepared ]; do echo  Waiting for nvidia-driver-ctr container to prepare the shared directory ...; sleep 10; done; exec /mnt/shared-nvidia-driver-toolkit/ocp_dtk_entrypoint dtk-build-driver
    State:          Running
      Started:      Sat, 11 Nov 2023 13:55:04 +0900
    Ready:          True
    Restart Count:  0
    Environment:
      RHCOS_VERSION:           412.86.202306132230-0
      NVIDIA_VISIBLE_DEVICES:  void
    Mounts:
      /host-etc/os-release from host-os-release (ro)
      /mnt/shared-nvidia-driver-toolkit from shared-nvidia-driver-toolkit (rw)
      /run/mellanox/drivers/usr/src from mlnx-ofed-usr-src (rw)
      /var/log from var-log (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-tsfd2 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  run-nvidia:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia
    HostPathType:  DirectoryOrCreate
  var-log:
    Type:          HostPath (bare host directory volume)
    Path:          /var/log
    HostPathType:  
  dev-log:
    Type:          HostPath (bare host directory volume)
    Path:          /dev/log
    HostPathType:  
  host-os-release:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/os-release
    HostPathType:  
  run-nvidia-topologyd:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia-topologyd
    HostPathType:  DirectoryOrCreate
  mlnx-ofed-usr-src:
    Type:          HostPath (bare host directory volume)
    Path:          /run/mellanox/drivers/usr/src
    HostPathType:  DirectoryOrCreate
  run-mellanox-drivers:
    Type:          HostPath (bare host directory volume)
    Path:          /run/mellanox/drivers
    HostPathType:  DirectoryOrCreate
  run-nvidia-validations:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/validations
    HostPathType:  DirectoryOrCreate
  host-root:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:  
  host-sys:
    Type:          HostPath (bare host directory volume)
    Path:          /sys
    HostPathType:  Directory
  shared-nvidia-driver-toolkit:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  kube-api-access-tsfd2:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   BestEffort
Node-Selectors:              feature.node.kubernetes.io/system-os_release.OSTREE_VERSION=412.86.202306132230-0
                             nvidia.com/gpu.deploy.driver=true
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason   Age                        From     Message
  ----     ------   ----                       ----     -------
  Warning  BackOff  3m23s (x20021 over 2d19h)  kubelet  Back-off restarting failed container


Name:                 nvidia-driver-daemonset-412.86.202306132230-0-jnt77
Namespace:            nvidia-gpu-operator
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      nvidia-driver
Node:                 gdr-test-6p2kc-gdr-gpu-il-worker-3-5f7m6/10.241.128.26
Start Time:           Sat, 11 Nov 2023 13:46:21 +0900
Labels:               app=nvidia-driver-daemonset-412.86.202306132230-0
                      app.kubernetes.io/component=nvidia-driver
                      controller-revision-hash=56f9b89d7c
                      nvidia.com/precompiled=false
                      openshift.driver-toolkit=true
                      pod-template-generation=1
Annotations:          k8s.ovn.org/pod-networks:
                        {"default":{"ip_addresses":["10.129.4.15/23"],"mac_address":"0a:58:0a:81:04:0f","gateway_ips":["10.129.4.1"],"ip_address":"10.129.4.15/23"...
                      k8s.v1.cni.cncf.io/network-status:
                        [{
                            "name": "ovn-kubernetes",
                            "interface": "eth0",
                            "ips": [
                                "10.129.4.15"
                            ],
                            "mac": "0a:58:0a:81:04:0f",
                            "default": true,
                            "dns": {}
                        }]
                      k8s.v1.cni.cncf.io/networks-status:
                        [{
                            "name": "ovn-kubernetes",
                            "interface": "eth0",
                            "ips": [
                                "10.129.4.15"
                            ],
                            "mac": "0a:58:0a:81:04:0f",
                            "default": true,
                            "dns": {}
                        }]
                      kubectl.kubernetes.io/default-container: nvidia-driver-ctr
                      openshift.io/scc: nvidia-driver
Status:               Running
IP:                   10.129.4.15
IPs:
  IP:           10.129.4.15
Controlled By:  DaemonSet/nvidia-driver-daemonset-412.86.202306132230-0
Init Containers:
  mofed-validation:
    Container ID:  cri-o://839bada30f72f1352910ddbadd423f696df70d6261612209b6f66747ab3dc0e2
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:c3fc8ab2d39d970e3d1a1b0ef16b06792d23cc87be68ed4927c7384ddd1f43cb
    Image ID:      nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:47a658fa7102d99a5dd9fe05f2a5b872deab266138e7955a14ba59e33095738d
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Sat, 11 Nov 2023 13:47:08 +0900
      Finished:     Sat, 11 Nov 2023 13:52:49 +0900
    Ready:          True
    Restart Count:  0
    Environment:
      WITH_WAIT:                true
      COMPONENT:                mofed
      NODE_NAME:                 (v1:spec.nodeName)
      NVIDIA_VISIBLE_DEVICES:   void
      GPU_DIRECT_RDMA_ENABLED:  true
    Mounts:
      /run/mellanox/drivers from run-mellanox-drivers (rw)
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-gjfks (ro)
  k8s-driver-manager:
    Container ID:  cri-o://d38add89b79b81ed16884186c202b97544da02b38f2788da6bbf996e284da0f7
    Image:         nvcr.io/nvidia/cloud-native/k8s-driver-manager@sha256:a360ed5b1335436ef61cd601fa776e6d03f15f76aeaa8d88bd1506edd93843dc
    Image ID:      nvcr.io/nvidia/cloud-native/k8s-driver-manager@sha256:5ca81f4f7e55f7b304dbbb7aaa235fca2656789145e4b34f47a7ab7079704dc7
    Port:          <none>
    Host Port:     <none>
    Command:
      driver-manager
    Args:
      uninstall_driver
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Sat, 11 Nov 2023 13:53:05 +0900
      Finished:     Sat, 11 Nov 2023 13:53:38 +0900
    Ready:          True
    Restart Count:  0
    Environment:
      NODE_NAME:                    (v1:spec.nodeName)
      NVIDIA_VISIBLE_DEVICES:      void
      ENABLE_GPU_POD_EVICTION:     true
      ENABLE_AUTO_DRAIN:           true
      DRAIN_USE_FORCE:             false
      DRAIN_POD_SELECTOR_LABEL:    
      DRAIN_TIMEOUT_SECONDS:       0s
      DRAIN_DELETE_EMPTYDIR_DATA:  false
      OPERATOR_NAMESPACE:          nvidia-gpu-operator (v1:metadata.namespace)
    Mounts:
      /host from host-root (ro)
      /run/nvidia from run-nvidia (rw)
      /sys from host-sys (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-gjfks (ro)
Containers:
  nvidia-driver-ctr:
    Container ID:  cri-o://cbbcb472eba95611199c2099390911ea2cbd1715b2e8ee44a165fb2b3dffd1dc
    Image:         nvcr.io/nvidia/driver@sha256:7c2df95df9ed4d16ad3b3c84079ccdd161f3639527ac1d90b106217f9f0a3aad
    Image ID:      nvcr.io/nvidia/driver@sha256:00d2137e198eeb72dd972494e2a651e1f67556fcb1f5a93650868f5b2115de8d
    Port:          <none>
    Host Port:     <none>
    Command:
      ocp_dtk_entrypoint
    Args:
      nv-ctr-run-with-dtk
    State:          Running
      Started:      Sat, 11 Nov 2023 13:53:49 +0900
    Ready:          True
    Restart Count:  0
    Startup:        exec [sh -c nvidia-smi && touch /run/nvidia/validations/.driver-ctr-ready] delay=60s timeout=60s period=10s #success=1 #failure=120
    Environment:
      GPU_DIRECT_RDMA_ENABLED:  true
      OPENSHIFT_VERSION:        4.12
    Mounts:
      /dev/log from dev-log (rw)
      /host-etc/os-release from host-os-release (ro)
      /mnt/shared-nvidia-driver-toolkit from shared-nvidia-driver-toolkit (rw)
      /run/mellanox/drivers from run-mellanox-drivers (rw)
      /run/mellanox/drivers/usr/src from mlnx-ofed-usr-src (rw)
      /run/nvidia from run-nvidia (rw)
      /run/nvidia-topologyd from run-nvidia-topologyd (rw)
      /var/log from var-log (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-gjfks (ro)
  nvidia-peermem-ctr:
    Container ID:  cri-o://0b03f994e5e99b3b8ac54b3cea06a4df1ce783fc5fb1f41ec31f18933862dd72
    Image:         nvcr.io/nvidia/driver@sha256:7c2df95df9ed4d16ad3b3c84079ccdd161f3639527ac1d90b106217f9f0a3aad
    Image ID:      nvcr.io/nvidia/driver@sha256:00d2137e198eeb72dd972494e2a651e1f67556fcb1f5a93650868f5b2115de8d
    Port:          <none>
    Host Port:     <none>
    Command:
      nvidia-driver
    Args:
      reload_nvidia_peermem
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 14 Nov 2023 09:12:42 +0900
      Finished:     Tue, 14 Nov 2023 09:12:42 +0900
    Ready:          False
    Restart Count:  793
    Liveness:       exec [sh -c nvidia-driver probe_nvidia_peermem] delay=30s timeout=10s period=30s #success=1 #failure=1
    Startup:        exec [sh -c nvidia-driver probe_nvidia_peermem] delay=10s timeout=10s period=10s #success=1 #failure=120
    Environment:    <none>
    Mounts:
      /dev/log from dev-log (ro)
      /run/mellanox/drivers from run-mellanox-drivers (rw)
      /run/nvidia from run-nvidia (rw)
      /var/log from var-log (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-gjfks (ro)
  openshift-driver-toolkit-ctr:
    Container ID:  cri-o://2d2dcf8bdf1ac22860bb69253caf06106436db8ff151411664f5b113d7cfda02
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:76662154f549f1edde1b61aeebee11b5e23ea3c4809551532c2edcd6ad1993db
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:76662154f549f1edde1b61aeebee11b5e23ea3c4809551532c2edcd6ad1993db
    Port:          <none>
    Host Port:     <none>
    Command:
      bash
      -xc
    Args:
      until [ -f /mnt/shared-nvidia-driver-toolkit/dir_prepared ]; do echo  Waiting for nvidia-driver-ctr container to prepare the shared directory ...; sleep 10; done; exec /mnt/shared-nvidia-driver-toolkit/ocp_dtk_entrypoint dtk-build-driver
    State:          Running
      Started:      Sat, 11 Nov 2023 13:54:08 +0900
    Ready:          True
    Restart Count:  0
    Environment:
      RHCOS_VERSION:           412.86.202306132230-0
      NVIDIA_VISIBLE_DEVICES:  void
    Mounts:
      /host-etc/os-release from host-os-release (ro)
      /mnt/shared-nvidia-driver-toolkit from shared-nvidia-driver-toolkit (rw)
      /run/mellanox/drivers/usr/src from mlnx-ofed-usr-src (rw)
      /var/log from var-log (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-gjfks (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  run-nvidia:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia
    HostPathType:  DirectoryOrCreate
  var-log:
    Type:          HostPath (bare host directory volume)
    Path:          /var/log
    HostPathType:  
  dev-log:
    Type:          HostPath (bare host directory volume)
    Path:          /dev/log
    HostPathType:  
  host-os-release:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/os-release
    HostPathType:  
  run-nvidia-topologyd:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia-topologyd
    HostPathType:  DirectoryOrCreate
  mlnx-ofed-usr-src:
    Type:          HostPath (bare host directory volume)
    Path:          /run/mellanox/drivers/usr/src
    HostPathType:  DirectoryOrCreate
  run-mellanox-drivers:
    Type:          HostPath (bare host directory volume)
    Path:          /run/mellanox/drivers
    HostPathType:  DirectoryOrCreate
  run-nvidia-validations:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/validations
    HostPathType:  DirectoryOrCreate
  host-root:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:  
  host-sys:
    Type:          HostPath (bare host directory volume)
    Path:          /sys
    HostPathType:  Directory
  shared-nvidia-driver-toolkit:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  kube-api-access-gjfks:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   BestEffort
Node-Selectors:              feature.node.kubernetes.io/system-os_release.OSTREE_VERSION=412.86.202306132230-0
                             nvidia.com/gpu.deploy.driver=true
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason   Age                        From     Message
  ----     ------   ----                       ----     -------
  Normal   Pulled   43m (x786 over 2d19h)      kubelet  Container image "nvcr.io/nvidia/driver@sha256:7c2df95df9ed4d16ad3b3c84079ccdd161f3639527ac1d90b106217f9f0a3aad" already present on machine
  Warning  BackOff  3m15s (x20037 over 2d19h)  kubelet  Back-off restarting failed container


Name:                 nvidia-driver-daemonset-412.86.202306132230-0-w9mv5
Namespace:            nvidia-gpu-operator
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      nvidia-driver
Node:                 gdr-test-6p2kc-gdr-gpu-il-worker-3-rjkr9/10.241.128.30
Start Time:           Tue, 14 Nov 2023 05:19:12 +0900
Labels:               app=nvidia-driver-daemonset-412.86.202306132230-0
                      app.kubernetes.io/component=nvidia-driver
                      controller-revision-hash=56f9b89d7c
                      nvidia.com/precompiled=false
                      openshift.driver-toolkit=true
                      pod-template-generation=1
Annotations:          k8s.ovn.org/pod-networks:
                        {"default":{"ip_addresses":["10.129.6.14/23"],"mac_address":"0a:58:0a:81:06:0e","gateway_ips":["10.129.6.1"],"ip_address":"10.129.6.14/23"...
                      k8s.v1.cni.cncf.io/network-status:
                        [{
                            "name": "ovn-kubernetes",
                            "interface": "eth0",
                            "ips": [
                                "10.129.6.14"
                            ],
                            "mac": "0a:58:0a:81:06:0e",
                            "default": true,
                            "dns": {}
                        }]
                      k8s.v1.cni.cncf.io/networks-status:
                        [{
                            "name": "ovn-kubernetes",
                            "interface": "eth0",
                            "ips": [
                                "10.129.6.14"
                            ],
                            "mac": "0a:58:0a:81:06:0e",
                            "default": true,
                            "dns": {}
                        }]
                      kubectl.kubernetes.io/default-container: nvidia-driver-ctr
                      openshift.io/scc: nvidia-driver
Status:               Running
IP:                   10.129.6.14
IPs:
  IP:           10.129.6.14
Controlled By:  DaemonSet/nvidia-driver-daemonset-412.86.202306132230-0
Init Containers:
  mofed-validation:
    Container ID:  cri-o://a2a13f79f84105dd86aae8c33a52c1ba28d94fc65bf0734ec8744751de7a4577
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:c3fc8ab2d39d970e3d1a1b0ef16b06792d23cc87be68ed4927c7384ddd1f43cb
    Image ID:      nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:47a658fa7102d99a5dd9fe05f2a5b872deab266138e7955a14ba59e33095738d
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 14 Nov 2023 05:20:01 +0900
      Finished:     Tue, 14 Nov 2023 05:25:51 +0900
    Ready:          True
    Restart Count:  0
    Environment:
      WITH_WAIT:                true
      COMPONENT:                mofed
      NODE_NAME:                 (v1:spec.nodeName)
      NVIDIA_VISIBLE_DEVICES:   void
      GPU_DIRECT_RDMA_ENABLED:  true
    Mounts:
      /run/mellanox/drivers from run-mellanox-drivers (rw)
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-tcd85 (ro)
  k8s-driver-manager:
    Container ID:  cri-o://91a1bdb7341f0b22bdd25696ee1f7e009513b245b0e434c169b278d7fb3df675
    Image:         nvcr.io/nvidia/cloud-native/k8s-driver-manager@sha256:a360ed5b1335436ef61cd601fa776e6d03f15f76aeaa8d88bd1506edd93843dc
    Image ID:      nvcr.io/nvidia/cloud-native/k8s-driver-manager@sha256:5ca81f4f7e55f7b304dbbb7aaa235fca2656789145e4b34f47a7ab7079704dc7
    Port:          <none>
    Host Port:     <none>
    Command:
      driver-manager
    Args:
      uninstall_driver
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 14 Nov 2023 05:26:06 +0900
      Finished:     Tue, 14 Nov 2023 05:26:38 +0900
    Ready:          True
    Restart Count:  0
    Environment:
      NODE_NAME:                    (v1:spec.nodeName)
      NVIDIA_VISIBLE_DEVICES:      void
      ENABLE_GPU_POD_EVICTION:     true
      ENABLE_AUTO_DRAIN:           true
      DRAIN_USE_FORCE:             false
      DRAIN_POD_SELECTOR_LABEL:    
      DRAIN_TIMEOUT_SECONDS:       0s
      DRAIN_DELETE_EMPTYDIR_DATA:  false
      OPERATOR_NAMESPACE:          nvidia-gpu-operator (v1:metadata.namespace)
    Mounts:
      /host from host-root (ro)
      /run/nvidia from run-nvidia (rw)
      /sys from host-sys (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-tcd85 (ro)
Containers:
  nvidia-driver-ctr:
    Container ID:  cri-o://afa3c3302c3632b1dbe012c4cbd98c72bf427798731dfbb7de96a3e6f834dde2
    Image:         nvcr.io/nvidia/driver@sha256:7c2df95df9ed4d16ad3b3c84079ccdd161f3639527ac1d90b106217f9f0a3aad
    Image ID:      nvcr.io/nvidia/driver@sha256:00d2137e198eeb72dd972494e2a651e1f67556fcb1f5a93650868f5b2115de8d
    Port:          <none>
    Host Port:     <none>
    Command:
      ocp_dtk_entrypoint
    Args:
      nv-ctr-run-with-dtk
    State:          Running
      Started:      Tue, 14 Nov 2023 05:26:50 +0900
    Ready:          True
    Restart Count:  0
    Startup:        exec [sh -c nvidia-smi && touch /run/nvidia/validations/.driver-ctr-ready] delay=60s timeout=60s period=10s #success=1 #failure=120
    Environment:
      GPU_DIRECT_RDMA_ENABLED:  true
      OPENSHIFT_VERSION:        4.12
    Mounts:
      /dev/log from dev-log (rw)
      /host-etc/os-release from host-os-release (ro)
      /mnt/shared-nvidia-driver-toolkit from shared-nvidia-driver-toolkit (rw)
      /run/mellanox/drivers from run-mellanox-drivers (rw)
      /run/mellanox/drivers/usr/src from mlnx-ofed-usr-src (rw)
      /run/nvidia from run-nvidia (rw)
      /run/nvidia-topologyd from run-nvidia-topologyd (rw)
      /var/log from var-log (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-tcd85 (ro)
  nvidia-peermem-ctr:
    Container ID:  cri-o://9840e4da6a6c0ade2d89706174fc4cb653a80f4224c66ece44baae4dd5675521
    Image:         nvcr.io/nvidia/driver@sha256:7c2df95df9ed4d16ad3b3c84079ccdd161f3639527ac1d90b106217f9f0a3aad
    Image ID:      nvcr.io/nvidia/driver@sha256:00d2137e198eeb72dd972494e2a651e1f67556fcb1f5a93650868f5b2115de8d
    Port:          <none>
    Host Port:     <none>
    Command:
      nvidia-driver
    Args:
      reload_nvidia_peermem
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 14 Nov 2023 09:12:50 +0900
      Finished:     Tue, 14 Nov 2023 09:12:50 +0900
    Ready:          False
    Restart Count:  49
    Liveness:       exec [sh -c nvidia-driver probe_nvidia_peermem] delay=30s timeout=10s period=30s #success=1 #failure=1
    Startup:        exec [sh -c nvidia-driver probe_nvidia_peermem] delay=10s timeout=10s period=10s #success=1 #failure=120
    Environment:    <none>
    Mounts:
      /dev/log from dev-log (ro)
      /run/mellanox/drivers from run-mellanox-drivers (rw)
      /run/nvidia from run-nvidia (rw)
      /var/log from var-log (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-tcd85 (ro)
  openshift-driver-toolkit-ctr:
    Container ID:  cri-o://ebff88db5258e776aa02d8176fee4c780a311686fb0cf3d8b7c5f93e4e4edb70
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:76662154f549f1edde1b61aeebee11b5e23ea3c4809551532c2edcd6ad1993db
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:76662154f549f1edde1b61aeebee11b5e23ea3c4809551532c2edcd6ad1993db
    Port:          <none>
    Host Port:     <none>
    Command:
      bash
      -xc
    Args:
      until [ -f /mnt/shared-nvidia-driver-toolkit/dir_prepared ]; do echo  Waiting for nvidia-driver-ctr container to prepare the shared directory ...; sleep 10; done; exec /mnt/shared-nvidia-driver-toolkit/ocp_dtk_entrypoint dtk-build-driver
    State:          Running
      Started:      Tue, 14 Nov 2023 05:27:20 +0900
    Ready:          True
    Restart Count:  0
    Environment:
      RHCOS_VERSION:           412.86.202306132230-0
      NVIDIA_VISIBLE_DEVICES:  void
    Mounts:
      /host-etc/os-release from host-os-release (ro)
      /mnt/shared-nvidia-driver-toolkit from shared-nvidia-driver-toolkit (rw)
      /run/mellanox/drivers/usr/src from mlnx-ofed-usr-src (rw)
      /var/log from var-log (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-tcd85 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  run-nvidia:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia
    HostPathType:  DirectoryOrCreate
  var-log:
    Type:          HostPath (bare host directory volume)
    Path:          /var/log
    HostPathType:  
  dev-log:
    Type:          HostPath (bare host directory volume)
    Path:          /dev/log
    HostPathType:  
  host-os-release:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/os-release
    HostPathType:  
  run-nvidia-topologyd:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia-topologyd
    HostPathType:  DirectoryOrCreate
  mlnx-ofed-usr-src:
    Type:          HostPath (bare host directory volume)
    Path:          /run/mellanox/drivers/usr/src
    HostPathType:  DirectoryOrCreate
  run-mellanox-drivers:
    Type:          HostPath (bare host directory volume)
    Path:          /run/mellanox/drivers
    HostPathType:  DirectoryOrCreate
  run-nvidia-validations:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/validations
    HostPathType:  DirectoryOrCreate
  host-root:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:  
  host-sys:
    Type:          HostPath (bare host directory volume)
    Path:          /sys
    HostPathType:  Directory
  shared-nvidia-driver-toolkit:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  kube-api-access-tcd85:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   BestEffort
Node-Selectors:              feature.node.kubernetes.io/system-os_release.OSTREE_VERSION=412.86.202306132230-0
                             nvidia.com/gpu.deploy.driver=true
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason   Age                     From     Message
  ----     ------   ----                    ----     -------
  Normal   Pulled   155m (x20 over 3h48m)   kubelet  Container image "nvcr.io/nvidia/driver@sha256:7c2df95df9ed4d16ad3b3c84079ccdd161f3639527ac1d90b106217f9f0a3aad" already present on machine
  Warning  BackOff  43s (x1136 over 3h47m)  kubelet  Back-off restarting failed container

Containers.nvidia-peermem-ctr should have had a mount for /host-etc/os-release but unfortunately it didn't have.

Nov 14 '23 03:11 takeshi-yoshimura

@takeshi-yoshimura we are aware of this issue and fixing it as part of v23.9.1 release later this month. To workaround you can edit the nvidia-driver-daemonset and add an env RHEL_VERSION="" to nvidia-peermem-ctr container.

Nov 15 '23 18:11 shivamerla

Sounds good. Thanks!

Nov 16 '23 00:11 takeshi-yoshimura