nvidia-cuda-validator pods crashlooping in OKD4.7
1. Quick Debug Checklist
- [ ] Are you running on an Ubuntu 18.04 node? No
- [ ] Are you running Kubernetes v1.13+? Yes. v1.20
- [ ] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)? crio
- [ ] Do you have
i2c_coreandipmi_msghandlerloaded on the nodes? yes - [ ] Did you apply the CRD (
kubectl describe clusterpolicies --all-namespaces) yes
1. Issue or feature description
Deploy the gpu-operator in the okd (4.7.0) cluster , but the nvidia-cuda-validator pods crashlooping all the time like issue #253.
2. Steps to reproduce the issue
- Install the nvidia driver(470.57.02) and cuda(11.4.1) directly on the GPU machine of fedora coreos system ,not in container .
- Helm install the gpu-operator (1.8.1) with the --set driver.enabled=false parameter in the cluster.
- Take all the needed images to local repository and change values.yaml to download from local.
- In the namespace of gpu-operator, one pod running normally. But it the namespace of gpu-operator-resource, 5 pods running OK except for the nvidia-cuda-validator init pod crash all the time with log as below: Failed to allocate device vector A (error code no CUDA-capable device is detected)! [Vector addition of 50000 elements] At same time the nvidia-operator-vatidator init block in 2/4, waiting for it to complete. The strange thing I find is that it dose not download cuda:11.4.1-base-ubi8 image, so I guess it is the SCC problem or something like this? Or relate to the cuda install directly in the machine ? Please me with issue , thanks.
@william0212 Can you share the output of nvidia-smi run from the driver pod or any of the plugin/GFD pods? Is the GPU A100 80GB? Also can you share server model and output of lspci -vvv -d 10de: -xxx
My Gpu is V100 32G. There is no driver pod , becauce I install the driver directly in the hardware and --set driver.enabled=false when depoloy the GPU operator. The log below from the pod of nvidia-operator-validator - driver-validation: **running command chroot with args [/run/nvidia/driver nvidia-smi] Thu Sep 16 01:12:40 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla V100-PCIE... Off | 00000000:3B:00.0 Off | 0 | | N/A 34C P0 27W / 250W | 0MiB / 32510MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-PCIE... Off | 00000000:D8:00.0 Off | 0 | | N/A 35C P0 26W / 250W | 0MiB / 32510MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+** The NFD pods is just waiting , like this: gpu-feature-discovery: 2021/09/16 01:12:55 Running gpu-feature-discovery in version v0.4.1 gpu-feature-discovery: 2021/09/16 01:12:55 Loaded configuration: gpu-feature-discovery: 2021/09/16 01:12:55 Oneshot: false gpu-feature-discovery: 2021/09/16 01:12:55 FailOnInitError: true gpu-feature-discovery: 2021/09/16 01:12:55 SleepInterval: 1m0s gpu-feature-discovery: 2021/09/16 01:12:55 MigStrategy: single gpu-feature-discovery: 2021/09/16 01:12:55 NoTimestamp: false gpu-feature-discovery: 2021/09/16 01:12:55 OutputFilePath: /etc/kubernetes/node-feature-discovery/features.d/gfd gpu-feature-discovery: 2021/09/16 01:12:55 Start running gpu-feature-discovery: 2021/09/16 01:12:55 Writing labels to output file gpu-feature-discovery: 2021/09/16 01:12:55 Sleeping for 1m0s My server is Dell with Fedora coreos OS system base OKD platform. After lspci command you tell me, it shows: [root@worker200 core]# lspci -vvv -d 10de: -xxx 3b:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] (rev a1) Subsystem: NVIDIA Corporation Device 124a Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0 Interrupt: pin A routed to IRQ 156 NUMA node: 0 Region 0: Memory at ab000000 (32-bit, non-prefetchable) [size=16M] Region 1: Memory at 382000000000 (64-bit, prefetchable) [size=32G] Region 3: Memory at 382800000000 (64-bit, prefetchable) [size=32M] Capabilities: [60] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+ Address: 00000000fee00078 Data: 0000 Capabilities: [78] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 75.000W DevCtl: CorrErr- NonFatalErr+ FatalErr+ UnsupReq+ RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop- MaxPayload 256 bytes, MaxReadReq 512 bytes DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend- LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM not supported ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+ ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt- LnkSta: Speed 8GT/s (ok), Width x16 (ok) TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR- 10BitTagComp- 10BitTagReq- OBFF Via message, ExtFmt- EETLPPrefix- EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit- FRS- TPHComp- ExtTPHComp- AtomicOpsCap: 32bit- 64bit- 128bitCAS- DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis- LTR- OBFF Disabled, AtomicOpsCtl: ReqEn- LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS- LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+ EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest- Retimer- 2Retimers- CrosslinkRes: unsupported Capabilities: [100 v1] Virtual Channel Caps: LPEVC=0 RefClk=100ns PATEntryBits=1 Arb: Fixed- WRR32- WRR64- WRR128- Ctrl: ArbSelect=Fixed Status: InProgress- VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans- Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256- Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=ff Status: NegoPending- InProgress- Capabilities: [258 v1] L1 PM Substates L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+ PortCommonModeRestoreTime=255us PortTPowerOnTime=10us L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1- T_CommonMode=0us LTR1.2_Threshold=0ns L1SubCtl2: T_PwrOn=10us Capabilities: [128 v1] Power Budgeting > Capabilities: [420 v2] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP+ FCP+ CmpltTO+ CmpltAbrt+ UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- CEMsk: RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ AdvNonFatalErr+ AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn- MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap- HeaderLog: 00000000 00000000 00000000 00000000 Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 > Capabilities: [900 v1] Secondary PCI Express LnkCtl3: LnkEquIntrruptEn- PerformEqu- LaneErrStat: 0 Capabilities: [ac0 v1] Designated Vendor-Specific: Vendor=10de ID=0001 Rev=1 Len=12 <?> Kernel driver in use: nvidia Kernel modules: nouveau, nvidia_drm, nvidia 00: de 10 b6 1d 07 04 10 00 a1 00 02 03 00 00 00 00 10: 00 00 00 ab 0c 00 00 00 20 38 00 00 0c 00 00 00 20: 28 38 00 00 00 00 00 00 00 00 00 00 de 10 4a 12 30: 00 00 00 00 60 00 00 00 00 00 00 00 0b 01 00 00 40: de 10 4a 12 00 00 00 00 00 00 00 00 00 00 00 00 50: 03 00 00 00 01 00 00 00 ce d6 23 00 00 00 00 00 60: 01 68 03 00 08 00 00 00 05 78 81 00 78 00 e0 fe 70: 00 00 00 00 00 00 00 00 10 00 02 00 e1 8d 2c 01 80: 3e 21 00 00 03 41 45 00 40 01 03 11 00 00 00 00 90: 00 00 00 00 00 00 00 00 00 00 00 00 13 00 04 00 a0: 06 00 00 00 0e 00 00 00 03 00 1f 00 00 00 00 00 b0: 00 00 00 00 09 00 14 01 00 00 10 80 00 00 00 00 c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f0: 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Another information I want to tell you about Node Feature Discovery. Installing the version of 4.8.0 by Red Hat from OperatorHub of OKD, I find that today, all the nfd-worker in the namespace of openshift-operators is CrashLoopBackOff and shows log below:
1 nfd-worker.go:186] Node Feature Discovery Worker 1.16
I0916 01:06:26.742837 1 nfd-worker.go:187] NodeName: 'worker200.okd.med.thu'
I0916 01:06:26.743197 1 nfd-worker.go:422] configuration file "/etc/kubernetes/node-feature-discovery/nfd-worker.conf" parsed
I0916 01:06:26.743224 1 nfd-worker.go:457] worker (re-)configuration successfully completed
I0916 01:06:26.743253 1 nfd-worker.go:316] connecting to nfd-master at nfd-master:12000 ...
I0916 01:06:26.743271 1 clientconn.go:245] parsed scheme: ""
I0916 01:06:26.743281 1 clientconn.go:251] scheme "" not registered, fallback to default scheme
I0916 01:06:26.743307 1 resolver_conn_wrapper.go:172] ccResolverWrapper: sending update to cc: {[{nfd-master:12000
Today, I do uninstall the NFD operator by Red Hat and install the official NFD(v0.9.0).All the pods are running.
But after that , I use the command:
helm install --wait --generate-name
./gpu-operator \
--set nfd.enabled=false \ (because I have deployed above)
--set operator.defaultRuntime=crio
--set driver.enabled=false (because I have install on the local machine)
The result is the same . It is not download the cuda image 11.4.1-base-ubi8 I will show you the yaml of nvidia-cuda-validation:
kind: Pod
apiVersion: v1
metadata:
generateName: nvidia-cuda-validator-
annotations:
k8s.ovn.org/pod-networks: >-
{"default":{"ip_addresses":["10.143.0.189/23"],"mac_address":"0a:58:0a:8f:00:bd","gateway_ips":["10.143.0.1"],"ip_address":"10.143.0.189/23","gateway_ip":"10.143.0.1"}}
k8s.v1.cni.cncf.io/network-status: |-
[{
"name": "",
"interface": "eth0",
"ips": [
"10.143.0.189"
],
"mac": "0a:58:0a:8f:00:bd",
"default": true,
"dns": {}
}]
k8s.v1.cni.cncf.io/networks-status: |-
[{
"name": "",
"interface": "eth0",
"ips": [
"10.143.0.189"
],
"mac": "0a:58:0a:8f:00:bd",
"default": true,
"dns": {}
}]
openshift.io/scc: restricted
selfLink: /api/v1/namespaces/gpu-operator-resources/pods/nvidia-cuda-validator-wvcbh
resourceVersion: '10539813'
name: nvidia-cuda-validator-wvcbh
uid: c852a397-37b3-45aa-8c1a-4a3874a65098
creationTimestamp: '2021-09-16T12:47:59Z'
managedFields:
- manager: nvidia-validator
operation: Update
apiVersion: v1
time: '2021-09-16T12:47:59Z'
fieldsType: FieldsV1
fieldsV1:
'f:metadata':
'f:generateName': {}
'f:labels':
.: {}
'f:app': {}
'f:ownerReferences':
.: {}
'k:{"uid":"6d897169-f161-4003-840c-0a0760d7aa65"}':
.: {}
'f:apiVersion': {}
'f:blockOwnerDeletion': {}
'f:controller': {}
'f:kind': {}
'f:name': {}
'f:uid': {}
'f:spec':
'f:nodeName': {}
'f:containers':
'k:{"name":"nvidia-cuda-validator"}':
'f:image': {}
'f:terminationMessagePolicy': {}
.: {}
'f:resources': {}
'f:args': {}
'f:command': {}
'f:securityContext':
.: {}
'f:allowPrivilegeEscalation': {}
'f:terminationMessagePath': {}
'f:imagePullPolicy': {}
'f:name': {}
'f:dnsPolicy': {}
'f:tolerations': {}
'f:serviceAccount': {}
'f:restartPolicy': {}
'f:schedulerName': {}
'f:terminationGracePeriodSeconds': {}
'f:initContainers':
.: {}
'k:{"name":"cuda-validation"}':
'f:image': {}
'f:terminationMessagePolicy': {}
.: {}
'f:resources': {}
'f:args': {}
'f:command': {}
'f:securityContext':
.: {}
'f:allowPrivilegeEscalation': {}
'f:terminationMessagePath': {}
'f:imagePullPolicy': {}
'f:name': {}
'f:serviceAccountName': {}
'f:enableServiceLinks': {}
'f:securityContext':
.: {}
'f:fsGroup': {}
'f:seLinuxOptions':
'f:level': {}
- manager: ovnkube
operation: Update
apiVersion: v1
time: '2021-09-16T12:47:59Z'
fieldsType: FieldsV1
fieldsV1:
'f:metadata':
'f:annotations':
'f:k8s.ovn.org/pod-networks': {}
- manager: multus
operation: Update
apiVersion: v1
time: '2021-09-16T12:48:01Z'
fieldsType: FieldsV1
fieldsV1:
'f:metadata':
'f:annotations':
'f:k8s.v1.cni.cncf.io/network-status': {}
'f:k8s.v1.cni.cncf.io/networks-status': {}
- manager: kubelet
operation: Update
apiVersion: v1
time: '2021-09-16T12:48:02Z'
fieldsType: FieldsV1
fieldsV1:
'f:status':
'f:conditions':
.: {}
'k:{"type":"ContainersReady"}':
.: {}
'f:lastProbeTime': {}
'f:lastTransitionTime': {}
'f:message': {}
'f:reason': {}
'f:status': {}
'f:type': {}
'k:{"type":"Initialized"}':
.: {}
'f:lastProbeTime': {}
'f:lastTransitionTime': {}
'f:message': {}
'f:reason': {}
'f:status': {}
'f:type': {}
'k:{"type":"PodScheduled"}':
.: {}
'f:lastProbeTime': {}
'f:lastTransitionTime': {}
'f:status': {}
'f:type': {}
'k:{"type":"Ready"}':
.: {}
'f:lastProbeTime': {}
'f:lastTransitionTime': {}
'f:message': {}
'f:reason': {}
'f:status': {}
'f:type': {}
'f:containerStatuses': {}
'f:hostIP': {}
'f:initContainerStatuses': {}
'f:podIP': {}
'f:podIPs':
.: {}
'k:{"ip":"10.143.0.189"}':
.: {}
'f:ip': {}
'f:startTime': {}
namespace: gpu-operator-resources
ownerReferences:
- apiVersion: nvidia.com/v1
kind: ClusterPolicy
name: cluster-policy
uid: 6d897169-f161-4003-840c-0a0760d7aa65
controller: true
blockOwnerDeletion: true
labels:
app: nvidia-cuda-validator
spec:
restartPolicy: OnFailure
initContainers:
- resources: {}
terminationMessagePath: /dev/termination-log
name: cuda-validation
command:
- sh
- '-c'
securityContext:
capabilities:
drop:
- KILL
- MKNOD
- SETGID
- SETUID
runAsUser: 1000700000
allowPrivilegeEscalation: false
imagePullPolicy: IfNotPresent
volumeMounts:
- name: nvidia-operator-validator-token-ltf8d
readOnly: true
mountPath: /var/run/secrets/kubernetes.io/serviceaccount
terminationMessagePolicy: File
image: 'base.med.thu/gpuins/gpu-operator-validator:v1.8.1'
args:
- vectorAdd
serviceAccountName: nvidia-operator-validator
imagePullSecrets:
- name: nvidia-operator-validator-dockercfg-v9vd8
priority: 0
schedulerName: default-scheduler
enableServiceLinks: true
terminationGracePeriodSeconds: 30
preemptionPolicy: PreemptLowerPriority
nodeName: worker200.okd.med.thu
securityContext:
seLinuxOptions:
level: 's0:c26,c25'
fsGroup: 1000700000
containers:
- resources: {}
terminationMessagePath: /dev/termination-log
name: nvidia-cuda-validator
command:
- sh
- '-c'
securityContext:
capabilities:
drop:
- KILL
- MKNOD
- SETGID
- SETUID
runAsUser: 1000700000
allowPrivilegeEscalation: false
imagePullPolicy: IfNotPresent
volumeMounts:
- name: nvidia-operator-validator-token-ltf8d
readOnly: true
mountPath: /var/run/secrets/kubernetes.io/serviceaccount
terminationMessagePolicy: File
image: 'base.med.thu/gpuins/gpu-operator-validator:v1.8.1'
args:
- echo cuda workload validation is successful
serviceAccount: nvidia-operator-validator
volumes:
- name: nvidia-operator-validator-token-ltf8d
secret:
secretName: nvidia-operator-validator-token-ltf8d
defaultMode: 420
dnsPolicy: ClusterFirst
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
- key: node.kubernetes.io/not-ready
operator: Exists
effect: NoExecute
tolerationSeconds: 300
- key: node.kubernetes.io/unreachable
operator: Exists
effect: NoExecute
tolerationSeconds: 300
status:
containerStatuses:
- name: nvidia-cuda-validator
state:
waiting:
reason: PodInitializing
lastState: {}
ready: false
restartCount: 0
image: 'base.med.thu/gpuins/gpu-operator-validator:v1.8.1'
imageID: ''
started: false
qosClass: BestEffort
podIPs:
- ip: 10.143.0.189
podIP: 10.143.0.189
hostIP: 172.28.201.200
startTime: '2021-09-16T12:47:59Z'
initContainerStatuses:
- name: cuda-validation
state:
terminated:
exitCode: 1
reason: Error
startedAt: '2021-09-16T12:51:08Z'
finishedAt: '2021-09-16T12:51:08Z'
containerID: >-
cri-o://8e9118bfd55c231429750388203f9a5281609a43fa21378b89feb816b5aadda4
lastState:
terminated:
exitCode: 1
reason: Error
startedAt: '2021-09-16T12:49:37Z'
finishedAt: '2021-09-16T12:49:37Z'
containerID: >-
cri-o://6831c452a768f387cedd0db473d1bf33b5d308fe0fc1ee9cdd1836ee798810c4
ready: false
restartCount: 5
image: 'base.med.thu/gpuins/gpu-operator-validator:v1.8.1'
imageID: >-
base.med.thu/gpuins/gpu-operator-validator@sha256:7a70e95fd19c3425cd4394f4b47bbf2119a70bd22d67d72e485b4d730853262c
containerID: 'cri-o://8e9118bfd55c231429750388203f9a5281609a43fa21378b89feb816b5aadda4'
conditions:
- type: Initialized
status: 'False'
lastProbeTime: null
lastTransitionTime: '2021-09-16T12:47:59Z'
reason: ContainersNotInitialized
message: 'containers with incomplete status: [cuda-validation]'
- type: Ready
status: 'False'
lastProbeTime: null
lastTransitionTime: '2021-09-16T12:47:59Z'
reason: ContainersNotReady
message: 'containers with unready status: [nvidia-cuda-validator]'
- type: ContainersReady
status: 'False'
lastProbeTime: null
lastTransitionTime: '2021-09-16T12:47:59Z'
reason: ContainersNotReady
message: 'containers with unready status: [nvidia-cuda-validator]'
- type: PodScheduled
status: 'True'
lastProbeTime: null
lastTransitionTime: '2021-09-16T12:47:59Z'
phase: Pending
helm install --wait --generate-name ./gpu-operator
--set nfd.enabled=false \ (because I have deployed above) --set operator.defaultRuntime=crio --set driver.enabled=false (because I have install on the local machine)
For Helm Install on OCP you have to override toolkit/dcgm images as well.
helm install gpu-operator nvidia/gpu-operator --version=1.8.2 --set platform.openshift=true,operator.defaultRuntime=crio,nfd.enabled=false,toolkit.version=1.7.1-ubi8,dcgmExporter.version=2.2.9-2.4.0-ubi8,dcgm.version=2.2.3-ubi8,migManager.version=v0.1.3-ubi8
@shivamerla I followed your instruction. The result is as same as before.
The nvidia-cuda-validator init error . It didn't download the image of cuda . Please help me to find who control it for downloading the cuda image. I think it is the problem .
Or is there some config problem of node-feature-discovery ? or gpu-feature-discover?
This is the log from nvidia-container-toolkit: there is some error.
nvidia-container-toolkit-daemonset-nv6jk-nvidia-container-toolkit-ctr.log
@william0212 cuda-validator pod doesn't download cuda images, we have vectorAdd sample within gpu-operator-validator image which gets invoked at runtime. Wondering if cuda 11.4.1 package installed directly on host is causing any of this.
We should see toolkit logs on the host by adding debug fields as below.
$ cat /usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml
disable-require = false
[nvidia-container-cli]
debug = "/var/log/nvidia-container-cli.log"
environment = []
ldconfig = "@/run/nvidia/driver/sbin/ldconfig"
load-kmods = true
path = "/usr/local/nvidia/toolkit/nvidia-container-cli"
root = "/run/nvidia/driver"
[nvidia-container-runtime]
debug = "/var/log/nvidia-container-runtime.log"
[core@ocp-mgmt-host ~]$
[core@ocp-mgmt-host ~]$ oc get pods -n gpu-operator-resources
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-tm7nr 1/1 Running 2 6d20h
nvidia-container-toolkit-daemonset-xprxd 1/1 Running 0 6d20h
nvidia-cuda-validator-5xgst 0/1 Completed 0 6d20h
nvidia-dcgm-exporter-v29mn 1/1 Running 0 6d20h
nvidia-dcgm-q5lz7 1/1 Running 1 6d20h
nvidia-device-plugin-daemonset-92q8r 1/1 Running 1 6d20h
nvidia-device-plugin-validator-5lk29 0/1 Completed 0 6d20h
nvidia-driver-daemonset-p4cvr 1/1 Running 0 6d20h
nvidia-node-status-exporter-jc6zz 1/1 Running 0 6d20h
nvidia-operator-validator-xgmtj 1/1 Running 0 6d20h
[core@ocp-mgmt-host ~]$ oc delete pod nvidia-operator-validator-xgmtj -n gpu-operator-resources
pod "nvidia-operator-validator-xgmtj" deleted
[core@ocp-mgmt-host ~]$ ls -ltr /var/log/nvidia-container*
-rw-r--r--. 1 root root 154810 Oct 4 20:42 /var/log/nvidia-container-cli.log
in a project we face same issue, to fix, try to uninstall the NVIDIA driver from the node, and let driver: true and choose the right version of the driver (not all the Nvidia Driver has corresponding driver image) and this (make the driver: true) lets the GPU Operator install the driver and Cuda itself, I also think, when we set driver:false then we should also make the Cuda Validator off.
I have 3 nodes with Tesla-T4, A100 and A30. With Tesla-T4 nvidia-cuda-validator completed successfully, but with A100 and A30 nvidia-cuda-validator keeps crashlooping. "[Vector addition of 50000 elements] Failed to allocate vector A (error code initialization error)!" is in cuda-validator container's log. Is there any way to fix?