nvidia-peermem-ctr: /usr/local/bin/nvidia-driver: line 769: RHEL_VERSION: unbound variable
1. Quick Debug Information
- OS/Version(e.g. RHEL8.6, Ubuntu22.04): RHCOS4.13
- Kernel Version: 4.18.0-372.59.1.el8_6.x86_64
- Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): CRI-O
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): OCP
- GPU Operator Version: v23.9.0
2. Issue or feature description
Container nvidia-peermem-ctr in pod nvidia-driver-daemonset crashed. As I post in the below, the log said RHEL_VERSION was not set. I think the container should have mounted /etc/os-release so that it can inspect RHEL_VERSION like other containers in the same pod. The failure was at the line DNF_RELEASEVER="${RHEL_VERSION}" in /usr/local/bin/nvidia-driver.
3. Steps to reproduce the issue
I just installed the gpu operator recently with mostly default setting on a RHCOS4.13/OpenShift4.12 cluster (spec.driver.rdma.enabled=true and spec.driver.rdma.useHostMofed=false).
My workaround was just to specifying driver version to be older one (535.104.05) instead of the latest one (535.104.12?) in spec.driver.version in my clusterpolicy.
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
name: gpu-cluster-policy
spec:
driver:
version: 535.104.05
image: driver
repository: nvcr.io/nvidia
...
4. Information to attach (optional if deemed irrelevant)
- [x] kubernetes pods status:
kubectl get pods -n OPERATOR_NAMESPACE - [x] kubernetes daemonset status:
kubectl get ds -n OPERATOR_NAMESPACE - [x] If a pod/ds is in an error state or pending state
kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME - [x] If a pod/ds is in an error state or pending state
kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers - [x] Output from running
nvidia-smifrom the driver container:kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
kubectl get po:
NAME READY STATUS RESTARTS AGE
console-plugin-nvidia-gpu-5df7b85d4-m7knm 1/1 Running 0 10d
gpu-feature-discovery-74jf4 1/1 Running 0 3h45m
gpu-feature-discovery-d9bfz 1/1 Running 0 2d19h
gpu-feature-discovery-k6q22 1/1 Running 0 37m
gpu-operator-6d95d776d6-bvdng 1/1 Running 0 2d20h
grafana-deployment-6b4f9fcc9d-4dkcg 1/1 Running 0 10d
grafana-operator-controller-manager-595c7978b9-bq95m 2/2 Running 0 2d20h
nvidia-container-toolkit-daemonset-6hpr8 1/1 Running 0 37m
nvidia-container-toolkit-daemonset-d76d8 1/1 Running 0 2d19h
nvidia-container-toolkit-daemonset-wk65s 1/1 Running 0 3h45m
nvidia-cuda-validator-f5dvk 0/1 Completed 0 37m
nvidia-cuda-validator-mjhhf 0/1 Completed 0 3h43m
nvidia-dcgm-2bj8w 1/1 Running 0 3h45m
nvidia-dcgm-dm22g 1/1 Running 0 37m
nvidia-dcgm-exporter-cgc6f 1/1 Running 0 2d19h
nvidia-dcgm-exporter-jvscf 1/1 Running 0 37m
nvidia-dcgm-exporter-m2jqz 1/1 Running 0 3h45m
nvidia-dcgm-p5mds 1/1 Running 0 2d19h
nvidia-device-plugin-daemonset-6jsmf 1/1 Running 0 3h45m
nvidia-device-plugin-daemonset-jlkw8 1/1 Running 0 37m
nvidia-device-plugin-daemonset-rx7hv 1/1 Running 0 2d19h
nvidia-driver-daemonset-412.86.202306132230-0-fkjbj 2/3 CrashLoopBackOff 793 (69s ago) 2d19h
nvidia-driver-daemonset-412.86.202306132230-0-jnt77 2/3 CrashLoopBackOff 792 (4m45s ago) 2d19h
nvidia-driver-daemonset-412.86.202306132230-0-w9mv5 2/3 CrashLoopBackOff 48 (4m48s ago) 3h53m
nvidia-mig-manager-2nqpt 1/1 Running 0 3h42m
nvidia-mig-manager-b74w5 1/1 Running 0 2d19h
nvidia-mig-manager-hgb5r 1/1 Running 0 37m
nvidia-node-status-exporter-xbdc7 1/1 Running 0 2d19h
nvidia-node-status-exporter-xg627 1/1 Running 0 2d19h
nvidia-node-status-exporter-z56bw 1/1 Running 0 3h53m
nvidia-operator-validator-d2wjd 1/1 Running 0 2d19h
nvidia-operator-validator-j8cmt 1/1 Running 0 3h45m
nvidia-operator-validator-lf8tm 1/1 Running 0 37m
oc logs nvidia-driver-daemonset-412.86.202306132230-0 -c nvidia-peermem-ctr
+ set -eu
+ RUN_DIR=/run/nvidia
+ PID_FILE=/run/nvidia/nvidia-driver.pid
+ DRIVER_VERSION=535.104.12
+ KERNEL_UPDATE_HOOK=/run/kernel/postinst.d/update-nvidia-driver
+ NUM_VGPU_DEVICES=0
+ NVIDIA_MODULE_PARAMS=()
+ NVIDIA_UVM_MODULE_PARAMS=()
+ NVIDIA_MODESET_MODULE_PARAMS=()
+ NVIDIA_PEERMEM_MODULE_PARAMS=()
+ TARGETARCH=amd64
+ USE_HOST_MOFED=false
+ DNF_RELEASEVER=
+ OPENSHIFT_VERSION=
+ DRIVER_ARCH=x86_64
+ DRIVER_ARCH=x86_64
+ echo 'DRIVER_ARCH is x86_64'
DRIVER_ARCH is x86_64
+++ dirname -- /usr/local/bin/nvidia-driver
++ cd -- /usr/local/bin
++ pwd
+ SCRIPT_DIR=/usr/local/bin
+ source /usr/local/bin/common.sh
++ GPU_DIRECT_RDMA_ENABLED=false
++ GDS_ENABLED=false
+ '[' 1 -eq 0 ']'
+ command=reload_nvidia_peermem
+ shift
+ case "${command}" in
+ options=
+ '[' 0 -ne 0 ']'
+ eval set -- ''
++ set --
+ ACCEPT_LICENSE=
++ uname -r
+ KERNEL_VERSION=4.18.0-372.59.1.el8_6.x86_64
+ PRIVATE_KEY=
+ PACKAGE_TAG=
+ '[' 0 -ne 0 ']'
+ [[ -z '' ]]
+ _resolve_rhel_version
+ '[' -f /host-etc/os-release ']'
+ return 0
/usr/local/bin/nvidia-driver: line 769: RHEL_VERSION: unbound variable
oc describe po nvidia-driver-daemonset-412.86.202306132230-0
Name: nvidia-driver-daemonset-412.86.202306132230-0-fkjbj
Namespace: nvidia-gpu-operator
Priority: 2000001000
Priority Class Name: system-node-critical
Service Account: nvidia-driver
Node: gdr-test-6p2kc-gdr-gpu-il-worker-3-jqpk7/10.241.128.27
Start Time: Sat, 11 Nov 2023 13:46:21 +0900
Labels: app=nvidia-driver-daemonset-412.86.202306132230-0
app.kubernetes.io/component=nvidia-driver
controller-revision-hash=56f9b89d7c
nvidia.com/precompiled=false
openshift.driver-toolkit=true
pod-template-generation=1
Annotations: k8s.ovn.org/pod-networks:
{"default":{"ip_addresses":["10.130.4.10/23"],"mac_address":"0a:58:0a:82:04:0a","gateway_ips":["10.130.4.1"],"ip_address":"10.130.4.10/23"...
k8s.v1.cni.cncf.io/network-status:
[{
"name": "ovn-kubernetes",
"interface": "eth0",
"ips": [
"10.130.4.10"
],
"mac": "0a:58:0a:82:04:0a",
"default": true,
"dns": {}
}]
k8s.v1.cni.cncf.io/networks-status:
[{
"name": "ovn-kubernetes",
"interface": "eth0",
"ips": [
"10.130.4.10"
],
"mac": "0a:58:0a:82:04:0a",
"default": true,
"dns": {}
}]
kubectl.kubernetes.io/default-container: nvidia-driver-ctr
openshift.io/scc: nvidia-driver
Status: Running
IP: 10.130.4.10
IPs:
IP: 10.130.4.10
Controlled By: DaemonSet/nvidia-driver-daemonset-412.86.202306132230-0
Init Containers:
mofed-validation:
Container ID: cri-o://643d541ca8e6a969364807d44f564bd1d92a0bacf23fcf73e6a23d17aa3b36e6
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:c3fc8ab2d39d970e3d1a1b0ef16b06792d23cc87be68ed4927c7384ddd1f43cb
Image ID: nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:47a658fa7102d99a5dd9fe05f2a5b872deab266138e7955a14ba59e33095738d
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
State: Terminated
Reason: Completed
Exit Code: 0
Started: Sat, 11 Nov 2023 13:47:08 +0900
Finished: Sat, 11 Nov 2023 13:53:18 +0900
Ready: True
Restart Count: 0
Environment:
WITH_WAIT: true
COMPONENT: mofed
NODE_NAME: (v1:spec.nodeName)
NVIDIA_VISIBLE_DEVICES: void
GPU_DIRECT_RDMA_ENABLED: true
Mounts:
/run/mellanox/drivers from run-mellanox-drivers (rw)
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-tsfd2 (ro)
k8s-driver-manager:
Container ID: cri-o://d6dabf0b91a9bef8048c2c2c6da3dd51008e1ef5f58e607b9164b636f15411b6
Image: nvcr.io/nvidia/cloud-native/k8s-driver-manager@sha256:a360ed5b1335436ef61cd601fa776e6d03f15f76aeaa8d88bd1506edd93843dc
Image ID: nvcr.io/nvidia/cloud-native/k8s-driver-manager@sha256:5ca81f4f7e55f7b304dbbb7aaa235fca2656789145e4b34f47a7ab7079704dc7
Port: <none>
Host Port: <none>
Command:
driver-manager
Args:
uninstall_driver
State: Terminated
Reason: Completed
Exit Code: 0
Started: Sat, 11 Nov 2023 13:53:38 +0900
Finished: Sat, 11 Nov 2023 13:54:11 +0900
Ready: True
Restart Count: 0
Environment:
NODE_NAME: (v1:spec.nodeName)
NVIDIA_VISIBLE_DEVICES: void
ENABLE_GPU_POD_EVICTION: true
ENABLE_AUTO_DRAIN: true
DRAIN_USE_FORCE: false
DRAIN_POD_SELECTOR_LABEL:
DRAIN_TIMEOUT_SECONDS: 0s
DRAIN_DELETE_EMPTYDIR_DATA: false
OPERATOR_NAMESPACE: nvidia-gpu-operator (v1:metadata.namespace)
Mounts:
/host from host-root (ro)
/run/nvidia from run-nvidia (rw)
/sys from host-sys (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-tsfd2 (ro)
Containers:
nvidia-driver-ctr:
Container ID: cri-o://fcbeb32582dde85e6275cd869abac95ba9df285bf004d5d2a3a763a2465bc82f
Image: nvcr.io/nvidia/driver@sha256:7c2df95df9ed4d16ad3b3c84079ccdd161f3639527ac1d90b106217f9f0a3aad
Image ID: nvcr.io/nvidia/driver@sha256:00d2137e198eeb72dd972494e2a651e1f67556fcb1f5a93650868f5b2115de8d
Port: <none>
Host Port: <none>
Command:
ocp_dtk_entrypoint
Args:
nv-ctr-run-with-dtk
State: Running
Started: Sat, 11 Nov 2023 13:54:32 +0900
Ready: True
Restart Count: 0
Startup: exec [sh -c nvidia-smi && touch /run/nvidia/validations/.driver-ctr-ready] delay=60s timeout=60s period=10s #success=1 #failure=120
Environment:
GPU_DIRECT_RDMA_ENABLED: true
OPENSHIFT_VERSION: 4.12
Mounts:
/dev/log from dev-log (rw)
/host-etc/os-release from host-os-release (ro)
/mnt/shared-nvidia-driver-toolkit from shared-nvidia-driver-toolkit (rw)
/run/mellanox/drivers from run-mellanox-drivers (rw)
/run/mellanox/drivers/usr/src from mlnx-ofed-usr-src (rw)
/run/nvidia from run-nvidia (rw)
/run/nvidia-topologyd from run-nvidia-topologyd (rw)
/var/log from var-log (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-tsfd2 (ro)
nvidia-peermem-ctr:
Container ID: cri-o://b35eb3fd675b9c5125fd486bd44fe0fdb20eceb7d9a6c14d8511a9d738cb7db0
Image: nvcr.io/nvidia/driver@sha256:7c2df95df9ed4d16ad3b3c84079ccdd161f3639527ac1d90b106217f9f0a3aad
Image ID: nvcr.io/nvidia/driver@sha256:00d2137e198eeb72dd972494e2a651e1f67556fcb1f5a93650868f5b2115de8d
Port: <none>
Host Port: <none>
Command:
nvidia-driver
Args:
reload_nvidia_peermem
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Tue, 14 Nov 2023 09:11:17 +0900
Finished: Tue, 14 Nov 2023 09:11:17 +0900
Ready: False
Restart Count: 793
Liveness: exec [sh -c nvidia-driver probe_nvidia_peermem] delay=30s timeout=10s period=30s #success=1 #failure=1
Startup: exec [sh -c nvidia-driver probe_nvidia_peermem] delay=10s timeout=10s period=10s #success=1 #failure=120
Environment: <none>
Mounts:
/dev/log from dev-log (ro)
/run/mellanox/drivers from run-mellanox-drivers (rw)
/run/nvidia from run-nvidia (rw)
/var/log from var-log (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-tsfd2 (ro)
openshift-driver-toolkit-ctr:
Container ID: cri-o://d61b7fd1fb0e713c16464b2db64713cc2dca8a6b047f86f46501ee1317f9f41e
Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:76662154f549f1edde1b61aeebee11b5e23ea3c4809551532c2edcd6ad1993db
Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:76662154f549f1edde1b61aeebee11b5e23ea3c4809551532c2edcd6ad1993db
Port: <none>
Host Port: <none>
Command:
bash
-xc
Args:
until [ -f /mnt/shared-nvidia-driver-toolkit/dir_prepared ]; do echo Waiting for nvidia-driver-ctr container to prepare the shared directory ...; sleep 10; done; exec /mnt/shared-nvidia-driver-toolkit/ocp_dtk_entrypoint dtk-build-driver
State: Running
Started: Sat, 11 Nov 2023 13:55:04 +0900
Ready: True
Restart Count: 0
Environment:
RHCOS_VERSION: 412.86.202306132230-0
NVIDIA_VISIBLE_DEVICES: void
Mounts:
/host-etc/os-release from host-os-release (ro)
/mnt/shared-nvidia-driver-toolkit from shared-nvidia-driver-toolkit (rw)
/run/mellanox/drivers/usr/src from mlnx-ofed-usr-src (rw)
/var/log from var-log (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-tsfd2 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
run-nvidia:
Type: HostPath (bare host directory volume)
Path: /run/nvidia
HostPathType: DirectoryOrCreate
var-log:
Type: HostPath (bare host directory volume)
Path: /var/log
HostPathType:
dev-log:
Type: HostPath (bare host directory volume)
Path: /dev/log
HostPathType:
host-os-release:
Type: HostPath (bare host directory volume)
Path: /etc/os-release
HostPathType:
run-nvidia-topologyd:
Type: HostPath (bare host directory volume)
Path: /run/nvidia-topologyd
HostPathType: DirectoryOrCreate
mlnx-ofed-usr-src:
Type: HostPath (bare host directory volume)
Path: /run/mellanox/drivers/usr/src
HostPathType: DirectoryOrCreate
run-mellanox-drivers:
Type: HostPath (bare host directory volume)
Path: /run/mellanox/drivers
HostPathType: DirectoryOrCreate
run-nvidia-validations:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/validations
HostPathType: DirectoryOrCreate
host-root:
Type: HostPath (bare host directory volume)
Path: /
HostPathType:
host-sys:
Type: HostPath (bare host directory volume)
Path: /sys
HostPathType: Directory
shared-nvidia-driver-toolkit:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
kube-api-access-tsfd2:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
ConfigMapName: openshift-service-ca.crt
ConfigMapOptional: <nil>
QoS Class: BestEffort
Node-Selectors: feature.node.kubernetes.io/system-os_release.OSTREE_VERSION=412.86.202306132230-0
nvidia.com/gpu.deploy.driver=true
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning BackOff 3m23s (x20021 over 2d19h) kubelet Back-off restarting failed container
Name: nvidia-driver-daemonset-412.86.202306132230-0-jnt77
Namespace: nvidia-gpu-operator
Priority: 2000001000
Priority Class Name: system-node-critical
Service Account: nvidia-driver
Node: gdr-test-6p2kc-gdr-gpu-il-worker-3-5f7m6/10.241.128.26
Start Time: Sat, 11 Nov 2023 13:46:21 +0900
Labels: app=nvidia-driver-daemonset-412.86.202306132230-0
app.kubernetes.io/component=nvidia-driver
controller-revision-hash=56f9b89d7c
nvidia.com/precompiled=false
openshift.driver-toolkit=true
pod-template-generation=1
Annotations: k8s.ovn.org/pod-networks:
{"default":{"ip_addresses":["10.129.4.15/23"],"mac_address":"0a:58:0a:81:04:0f","gateway_ips":["10.129.4.1"],"ip_address":"10.129.4.15/23"...
k8s.v1.cni.cncf.io/network-status:
[{
"name": "ovn-kubernetes",
"interface": "eth0",
"ips": [
"10.129.4.15"
],
"mac": "0a:58:0a:81:04:0f",
"default": true,
"dns": {}
}]
k8s.v1.cni.cncf.io/networks-status:
[{
"name": "ovn-kubernetes",
"interface": "eth0",
"ips": [
"10.129.4.15"
],
"mac": "0a:58:0a:81:04:0f",
"default": true,
"dns": {}
}]
kubectl.kubernetes.io/default-container: nvidia-driver-ctr
openshift.io/scc: nvidia-driver
Status: Running
IP: 10.129.4.15
IPs:
IP: 10.129.4.15
Controlled By: DaemonSet/nvidia-driver-daemonset-412.86.202306132230-0
Init Containers:
mofed-validation:
Container ID: cri-o://839bada30f72f1352910ddbadd423f696df70d6261612209b6f66747ab3dc0e2
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:c3fc8ab2d39d970e3d1a1b0ef16b06792d23cc87be68ed4927c7384ddd1f43cb
Image ID: nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:47a658fa7102d99a5dd9fe05f2a5b872deab266138e7955a14ba59e33095738d
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
State: Terminated
Reason: Completed
Exit Code: 0
Started: Sat, 11 Nov 2023 13:47:08 +0900
Finished: Sat, 11 Nov 2023 13:52:49 +0900
Ready: True
Restart Count: 0
Environment:
WITH_WAIT: true
COMPONENT: mofed
NODE_NAME: (v1:spec.nodeName)
NVIDIA_VISIBLE_DEVICES: void
GPU_DIRECT_RDMA_ENABLED: true
Mounts:
/run/mellanox/drivers from run-mellanox-drivers (rw)
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-gjfks (ro)
k8s-driver-manager:
Container ID: cri-o://d38add89b79b81ed16884186c202b97544da02b38f2788da6bbf996e284da0f7
Image: nvcr.io/nvidia/cloud-native/k8s-driver-manager@sha256:a360ed5b1335436ef61cd601fa776e6d03f15f76aeaa8d88bd1506edd93843dc
Image ID: nvcr.io/nvidia/cloud-native/k8s-driver-manager@sha256:5ca81f4f7e55f7b304dbbb7aaa235fca2656789145e4b34f47a7ab7079704dc7
Port: <none>
Host Port: <none>
Command:
driver-manager
Args:
uninstall_driver
State: Terminated
Reason: Completed
Exit Code: 0
Started: Sat, 11 Nov 2023 13:53:05 +0900
Finished: Sat, 11 Nov 2023 13:53:38 +0900
Ready: True
Restart Count: 0
Environment:
NODE_NAME: (v1:spec.nodeName)
NVIDIA_VISIBLE_DEVICES: void
ENABLE_GPU_POD_EVICTION: true
ENABLE_AUTO_DRAIN: true
DRAIN_USE_FORCE: false
DRAIN_POD_SELECTOR_LABEL:
DRAIN_TIMEOUT_SECONDS: 0s
DRAIN_DELETE_EMPTYDIR_DATA: false
OPERATOR_NAMESPACE: nvidia-gpu-operator (v1:metadata.namespace)
Mounts:
/host from host-root (ro)
/run/nvidia from run-nvidia (rw)
/sys from host-sys (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-gjfks (ro)
Containers:
nvidia-driver-ctr:
Container ID: cri-o://cbbcb472eba95611199c2099390911ea2cbd1715b2e8ee44a165fb2b3dffd1dc
Image: nvcr.io/nvidia/driver@sha256:7c2df95df9ed4d16ad3b3c84079ccdd161f3639527ac1d90b106217f9f0a3aad
Image ID: nvcr.io/nvidia/driver@sha256:00d2137e198eeb72dd972494e2a651e1f67556fcb1f5a93650868f5b2115de8d
Port: <none>
Host Port: <none>
Command:
ocp_dtk_entrypoint
Args:
nv-ctr-run-with-dtk
State: Running
Started: Sat, 11 Nov 2023 13:53:49 +0900
Ready: True
Restart Count: 0
Startup: exec [sh -c nvidia-smi && touch /run/nvidia/validations/.driver-ctr-ready] delay=60s timeout=60s period=10s #success=1 #failure=120
Environment:
GPU_DIRECT_RDMA_ENABLED: true
OPENSHIFT_VERSION: 4.12
Mounts:
/dev/log from dev-log (rw)
/host-etc/os-release from host-os-release (ro)
/mnt/shared-nvidia-driver-toolkit from shared-nvidia-driver-toolkit (rw)
/run/mellanox/drivers from run-mellanox-drivers (rw)
/run/mellanox/drivers/usr/src from mlnx-ofed-usr-src (rw)
/run/nvidia from run-nvidia (rw)
/run/nvidia-topologyd from run-nvidia-topologyd (rw)
/var/log from var-log (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-gjfks (ro)
nvidia-peermem-ctr:
Container ID: cri-o://0b03f994e5e99b3b8ac54b3cea06a4df1ce783fc5fb1f41ec31f18933862dd72
Image: nvcr.io/nvidia/driver@sha256:7c2df95df9ed4d16ad3b3c84079ccdd161f3639527ac1d90b106217f9f0a3aad
Image ID: nvcr.io/nvidia/driver@sha256:00d2137e198eeb72dd972494e2a651e1f67556fcb1f5a93650868f5b2115de8d
Port: <none>
Host Port: <none>
Command:
nvidia-driver
Args:
reload_nvidia_peermem
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Tue, 14 Nov 2023 09:12:42 +0900
Finished: Tue, 14 Nov 2023 09:12:42 +0900
Ready: False
Restart Count: 793
Liveness: exec [sh -c nvidia-driver probe_nvidia_peermem] delay=30s timeout=10s period=30s #success=1 #failure=1
Startup: exec [sh -c nvidia-driver probe_nvidia_peermem] delay=10s timeout=10s period=10s #success=1 #failure=120
Environment: <none>
Mounts:
/dev/log from dev-log (ro)
/run/mellanox/drivers from run-mellanox-drivers (rw)
/run/nvidia from run-nvidia (rw)
/var/log from var-log (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-gjfks (ro)
openshift-driver-toolkit-ctr:
Container ID: cri-o://2d2dcf8bdf1ac22860bb69253caf06106436db8ff151411664f5b113d7cfda02
Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:76662154f549f1edde1b61aeebee11b5e23ea3c4809551532c2edcd6ad1993db
Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:76662154f549f1edde1b61aeebee11b5e23ea3c4809551532c2edcd6ad1993db
Port: <none>
Host Port: <none>
Command:
bash
-xc
Args:
until [ -f /mnt/shared-nvidia-driver-toolkit/dir_prepared ]; do echo Waiting for nvidia-driver-ctr container to prepare the shared directory ...; sleep 10; done; exec /mnt/shared-nvidia-driver-toolkit/ocp_dtk_entrypoint dtk-build-driver
State: Running
Started: Sat, 11 Nov 2023 13:54:08 +0900
Ready: True
Restart Count: 0
Environment:
RHCOS_VERSION: 412.86.202306132230-0
NVIDIA_VISIBLE_DEVICES: void
Mounts:
/host-etc/os-release from host-os-release (ro)
/mnt/shared-nvidia-driver-toolkit from shared-nvidia-driver-toolkit (rw)
/run/mellanox/drivers/usr/src from mlnx-ofed-usr-src (rw)
/var/log from var-log (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-gjfks (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
run-nvidia:
Type: HostPath (bare host directory volume)
Path: /run/nvidia
HostPathType: DirectoryOrCreate
var-log:
Type: HostPath (bare host directory volume)
Path: /var/log
HostPathType:
dev-log:
Type: HostPath (bare host directory volume)
Path: /dev/log
HostPathType:
host-os-release:
Type: HostPath (bare host directory volume)
Path: /etc/os-release
HostPathType:
run-nvidia-topologyd:
Type: HostPath (bare host directory volume)
Path: /run/nvidia-topologyd
HostPathType: DirectoryOrCreate
mlnx-ofed-usr-src:
Type: HostPath (bare host directory volume)
Path: /run/mellanox/drivers/usr/src
HostPathType: DirectoryOrCreate
run-mellanox-drivers:
Type: HostPath (bare host directory volume)
Path: /run/mellanox/drivers
HostPathType: DirectoryOrCreate
run-nvidia-validations:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/validations
HostPathType: DirectoryOrCreate
host-root:
Type: HostPath (bare host directory volume)
Path: /
HostPathType:
host-sys:
Type: HostPath (bare host directory volume)
Path: /sys
HostPathType: Directory
shared-nvidia-driver-toolkit:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
kube-api-access-gjfks:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
ConfigMapName: openshift-service-ca.crt
ConfigMapOptional: <nil>
QoS Class: BestEffort
Node-Selectors: feature.node.kubernetes.io/system-os_release.OSTREE_VERSION=412.86.202306132230-0
nvidia.com/gpu.deploy.driver=true
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulled 43m (x786 over 2d19h) kubelet Container image "nvcr.io/nvidia/driver@sha256:7c2df95df9ed4d16ad3b3c84079ccdd161f3639527ac1d90b106217f9f0a3aad" already present on machine
Warning BackOff 3m15s (x20037 over 2d19h) kubelet Back-off restarting failed container
Name: nvidia-driver-daemonset-412.86.202306132230-0-w9mv5
Namespace: nvidia-gpu-operator
Priority: 2000001000
Priority Class Name: system-node-critical
Service Account: nvidia-driver
Node: gdr-test-6p2kc-gdr-gpu-il-worker-3-rjkr9/10.241.128.30
Start Time: Tue, 14 Nov 2023 05:19:12 +0900
Labels: app=nvidia-driver-daemonset-412.86.202306132230-0
app.kubernetes.io/component=nvidia-driver
controller-revision-hash=56f9b89d7c
nvidia.com/precompiled=false
openshift.driver-toolkit=true
pod-template-generation=1
Annotations: k8s.ovn.org/pod-networks:
{"default":{"ip_addresses":["10.129.6.14/23"],"mac_address":"0a:58:0a:81:06:0e","gateway_ips":["10.129.6.1"],"ip_address":"10.129.6.14/23"...
k8s.v1.cni.cncf.io/network-status:
[{
"name": "ovn-kubernetes",
"interface": "eth0",
"ips": [
"10.129.6.14"
],
"mac": "0a:58:0a:81:06:0e",
"default": true,
"dns": {}
}]
k8s.v1.cni.cncf.io/networks-status:
[{
"name": "ovn-kubernetes",
"interface": "eth0",
"ips": [
"10.129.6.14"
],
"mac": "0a:58:0a:81:06:0e",
"default": true,
"dns": {}
}]
kubectl.kubernetes.io/default-container: nvidia-driver-ctr
openshift.io/scc: nvidia-driver
Status: Running
IP: 10.129.6.14
IPs:
IP: 10.129.6.14
Controlled By: DaemonSet/nvidia-driver-daemonset-412.86.202306132230-0
Init Containers:
mofed-validation:
Container ID: cri-o://a2a13f79f84105dd86aae8c33a52c1ba28d94fc65bf0734ec8744751de7a4577
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:c3fc8ab2d39d970e3d1a1b0ef16b06792d23cc87be68ed4927c7384ddd1f43cb
Image ID: nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:47a658fa7102d99a5dd9fe05f2a5b872deab266138e7955a14ba59e33095738d
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
State: Terminated
Reason: Completed
Exit Code: 0
Started: Tue, 14 Nov 2023 05:20:01 +0900
Finished: Tue, 14 Nov 2023 05:25:51 +0900
Ready: True
Restart Count: 0
Environment:
WITH_WAIT: true
COMPONENT: mofed
NODE_NAME: (v1:spec.nodeName)
NVIDIA_VISIBLE_DEVICES: void
GPU_DIRECT_RDMA_ENABLED: true
Mounts:
/run/mellanox/drivers from run-mellanox-drivers (rw)
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-tcd85 (ro)
k8s-driver-manager:
Container ID: cri-o://91a1bdb7341f0b22bdd25696ee1f7e009513b245b0e434c169b278d7fb3df675
Image: nvcr.io/nvidia/cloud-native/k8s-driver-manager@sha256:a360ed5b1335436ef61cd601fa776e6d03f15f76aeaa8d88bd1506edd93843dc
Image ID: nvcr.io/nvidia/cloud-native/k8s-driver-manager@sha256:5ca81f4f7e55f7b304dbbb7aaa235fca2656789145e4b34f47a7ab7079704dc7
Port: <none>
Host Port: <none>
Command:
driver-manager
Args:
uninstall_driver
State: Terminated
Reason: Completed
Exit Code: 0
Started: Tue, 14 Nov 2023 05:26:06 +0900
Finished: Tue, 14 Nov 2023 05:26:38 +0900
Ready: True
Restart Count: 0
Environment:
NODE_NAME: (v1:spec.nodeName)
NVIDIA_VISIBLE_DEVICES: void
ENABLE_GPU_POD_EVICTION: true
ENABLE_AUTO_DRAIN: true
DRAIN_USE_FORCE: false
DRAIN_POD_SELECTOR_LABEL:
DRAIN_TIMEOUT_SECONDS: 0s
DRAIN_DELETE_EMPTYDIR_DATA: false
OPERATOR_NAMESPACE: nvidia-gpu-operator (v1:metadata.namespace)
Mounts:
/host from host-root (ro)
/run/nvidia from run-nvidia (rw)
/sys from host-sys (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-tcd85 (ro)
Containers:
nvidia-driver-ctr:
Container ID: cri-o://afa3c3302c3632b1dbe012c4cbd98c72bf427798731dfbb7de96a3e6f834dde2
Image: nvcr.io/nvidia/driver@sha256:7c2df95df9ed4d16ad3b3c84079ccdd161f3639527ac1d90b106217f9f0a3aad
Image ID: nvcr.io/nvidia/driver@sha256:00d2137e198eeb72dd972494e2a651e1f67556fcb1f5a93650868f5b2115de8d
Port: <none>
Host Port: <none>
Command:
ocp_dtk_entrypoint
Args:
nv-ctr-run-with-dtk
State: Running
Started: Tue, 14 Nov 2023 05:26:50 +0900
Ready: True
Restart Count: 0
Startup: exec [sh -c nvidia-smi && touch /run/nvidia/validations/.driver-ctr-ready] delay=60s timeout=60s period=10s #success=1 #failure=120
Environment:
GPU_DIRECT_RDMA_ENABLED: true
OPENSHIFT_VERSION: 4.12
Mounts:
/dev/log from dev-log (rw)
/host-etc/os-release from host-os-release (ro)
/mnt/shared-nvidia-driver-toolkit from shared-nvidia-driver-toolkit (rw)
/run/mellanox/drivers from run-mellanox-drivers (rw)
/run/mellanox/drivers/usr/src from mlnx-ofed-usr-src (rw)
/run/nvidia from run-nvidia (rw)
/run/nvidia-topologyd from run-nvidia-topologyd (rw)
/var/log from var-log (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-tcd85 (ro)
nvidia-peermem-ctr:
Container ID: cri-o://9840e4da6a6c0ade2d89706174fc4cb653a80f4224c66ece44baae4dd5675521
Image: nvcr.io/nvidia/driver@sha256:7c2df95df9ed4d16ad3b3c84079ccdd161f3639527ac1d90b106217f9f0a3aad
Image ID: nvcr.io/nvidia/driver@sha256:00d2137e198eeb72dd972494e2a651e1f67556fcb1f5a93650868f5b2115de8d
Port: <none>
Host Port: <none>
Command:
nvidia-driver
Args:
reload_nvidia_peermem
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Tue, 14 Nov 2023 09:12:50 +0900
Finished: Tue, 14 Nov 2023 09:12:50 +0900
Ready: False
Restart Count: 49
Liveness: exec [sh -c nvidia-driver probe_nvidia_peermem] delay=30s timeout=10s period=30s #success=1 #failure=1
Startup: exec [sh -c nvidia-driver probe_nvidia_peermem] delay=10s timeout=10s period=10s #success=1 #failure=120
Environment: <none>
Mounts:
/dev/log from dev-log (ro)
/run/mellanox/drivers from run-mellanox-drivers (rw)
/run/nvidia from run-nvidia (rw)
/var/log from var-log (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-tcd85 (ro)
openshift-driver-toolkit-ctr:
Container ID: cri-o://ebff88db5258e776aa02d8176fee4c780a311686fb0cf3d8b7c5f93e4e4edb70
Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:76662154f549f1edde1b61aeebee11b5e23ea3c4809551532c2edcd6ad1993db
Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:76662154f549f1edde1b61aeebee11b5e23ea3c4809551532c2edcd6ad1993db
Port: <none>
Host Port: <none>
Command:
bash
-xc
Args:
until [ -f /mnt/shared-nvidia-driver-toolkit/dir_prepared ]; do echo Waiting for nvidia-driver-ctr container to prepare the shared directory ...; sleep 10; done; exec /mnt/shared-nvidia-driver-toolkit/ocp_dtk_entrypoint dtk-build-driver
State: Running
Started: Tue, 14 Nov 2023 05:27:20 +0900
Ready: True
Restart Count: 0
Environment:
RHCOS_VERSION: 412.86.202306132230-0
NVIDIA_VISIBLE_DEVICES: void
Mounts:
/host-etc/os-release from host-os-release (ro)
/mnt/shared-nvidia-driver-toolkit from shared-nvidia-driver-toolkit (rw)
/run/mellanox/drivers/usr/src from mlnx-ofed-usr-src (rw)
/var/log from var-log (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-tcd85 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
run-nvidia:
Type: HostPath (bare host directory volume)
Path: /run/nvidia
HostPathType: DirectoryOrCreate
var-log:
Type: HostPath (bare host directory volume)
Path: /var/log
HostPathType:
dev-log:
Type: HostPath (bare host directory volume)
Path: /dev/log
HostPathType:
host-os-release:
Type: HostPath (bare host directory volume)
Path: /etc/os-release
HostPathType:
run-nvidia-topologyd:
Type: HostPath (bare host directory volume)
Path: /run/nvidia-topologyd
HostPathType: DirectoryOrCreate
mlnx-ofed-usr-src:
Type: HostPath (bare host directory volume)
Path: /run/mellanox/drivers/usr/src
HostPathType: DirectoryOrCreate
run-mellanox-drivers:
Type: HostPath (bare host directory volume)
Path: /run/mellanox/drivers
HostPathType: DirectoryOrCreate
run-nvidia-validations:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/validations
HostPathType: DirectoryOrCreate
host-root:
Type: HostPath (bare host directory volume)
Path: /
HostPathType:
host-sys:
Type: HostPath (bare host directory volume)
Path: /sys
HostPathType: Directory
shared-nvidia-driver-toolkit:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
kube-api-access-tcd85:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
ConfigMapName: openshift-service-ca.crt
ConfigMapOptional: <nil>
QoS Class: BestEffort
Node-Selectors: feature.node.kubernetes.io/system-os_release.OSTREE_VERSION=412.86.202306132230-0
nvidia.com/gpu.deploy.driver=true
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulled 155m (x20 over 3h48m) kubelet Container image "nvcr.io/nvidia/driver@sha256:7c2df95df9ed4d16ad3b3c84079ccdd161f3639527ac1d90b106217f9f0a3aad" already present on machine
Warning BackOff 43s (x1136 over 3h47m) kubelet Back-off restarting failed container
Containers.nvidia-peermem-ctr should have had a mount for /host-etc/os-release but unfortunately it didn't have.
@takeshi-yoshimura we are aware of this issue and fixing it as part of v23.9.1 release later this month. To workaround you can edit the nvidia-driver-daemonset and add an env RHEL_VERSION="" to nvidia-peermem-ctr container.
Sounds good. Thanks!