gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

nvidia-cuda-validator pods crashlooping in OKD4.7

Open william0212 opened this issue 4 years ago • 9 comments

1. Quick Debug Checklist

  • [ ] Are you running on an Ubuntu 18.04 node? No
  • [ ] Are you running Kubernetes v1.13+? Yes. v1.20
  • [ ] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)? crio
  • [ ] Do you have i2c_core and ipmi_msghandler loaded on the nodes? yes
  • [ ] Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces) yes

1. Issue or feature description

Deploy the gpu-operator in the okd (4.7.0) cluster , but the nvidia-cuda-validator pods crashlooping all the time like issue #253.

2. Steps to reproduce the issue

  1. Install the nvidia driver(470.57.02) and cuda(11.4.1) directly on the GPU machine of fedora coreos system ,not in container .
  2. Helm install the gpu-operator (1.8.1) with the --set driver.enabled=false parameter in the cluster.
  3. Take all the needed images to local repository and change values.yaml to download from local.
  4. In the namespace of gpu-operator, one pod running normally. But it the namespace of gpu-operator-resource, 5 pods running OK except for the nvidia-cuda-validator init pod crash all the time with log as below: Failed to allocate device vector A (error code no CUDA-capable device is detected)! [Vector addition of 50000 elements] At same time the nvidia-operator-vatidator init block in 2/4, waiting for it to complete. The strange thing I find is that it dose not download cuda:11.4.1-base-ubi8 image, so I guess it is the SCC problem or something like this? Or relate to the cuda install directly in the machine ? Please me with issue , thanks.

william0212 avatar Sep 15 '21 09:09 william0212

@william0212 Can you share the output of nvidia-smi run from the driver pod or any of the plugin/GFD pods? Is the GPU A100 80GB? Also can you share server model and output of lspci -vvv -d 10de: -xxx

shivamerla avatar Sep 15 '21 17:09 shivamerla

My Gpu is V100 32G. There is no driver pod , becauce I install the driver directly in the hardware and --set driver.enabled=false when depoloy the GPU operator. The log below from the pod of nvidia-operator-validator - driver-validation: **running command chroot with args [/run/nvidia/driver nvidia-smi] Thu Sep 16 01:12:40 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla V100-PCIE... Off | 00000000:3B:00.0 Off | 0 | | N/A 34C P0 27W / 250W | 0MiB / 32510MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-PCIE... Off | 00000000:D8:00.0 Off | 0 | | N/A 35C P0 26W / 250W | 0MiB / 32510MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+** The NFD pods is just waiting , like this: gpu-feature-discovery: 2021/09/16 01:12:55 Running gpu-feature-discovery in version v0.4.1 gpu-feature-discovery: 2021/09/16 01:12:55 Loaded configuration: gpu-feature-discovery: 2021/09/16 01:12:55 Oneshot: false gpu-feature-discovery: 2021/09/16 01:12:55 FailOnInitError: true gpu-feature-discovery: 2021/09/16 01:12:55 SleepInterval: 1m0s gpu-feature-discovery: 2021/09/16 01:12:55 MigStrategy: single gpu-feature-discovery: 2021/09/16 01:12:55 NoTimestamp: false gpu-feature-discovery: 2021/09/16 01:12:55 OutputFilePath: /etc/kubernetes/node-feature-discovery/features.d/gfd gpu-feature-discovery: 2021/09/16 01:12:55 Start running gpu-feature-discovery: 2021/09/16 01:12:55 Writing labels to output file gpu-feature-discovery: 2021/09/16 01:12:55 Sleeping for 1m0s My server is Dell with Fedora coreos OS system base OKD platform. After lspci command you tell me, it shows: [root@worker200 core]# lspci -vvv -d 10de: -xxx 3b:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] (rev a1) Subsystem: NVIDIA Corporation Device 124a Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0 Interrupt: pin A routed to IRQ 156 NUMA node: 0 Region 0: Memory at ab000000 (32-bit, non-prefetchable) [size=16M] Region 1: Memory at 382000000000 (64-bit, prefetchable) [size=32G] Region 3: Memory at 382800000000 (64-bit, prefetchable) [size=32M] Capabilities: [60] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+ Address: 00000000fee00078 Data: 0000 Capabilities: [78] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 75.000W DevCtl: CorrErr- NonFatalErr+ FatalErr+ UnsupReq+ RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop- MaxPayload 256 bytes, MaxReadReq 512 bytes DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend- LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM not supported ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+ ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt- LnkSta: Speed 8GT/s (ok), Width x16 (ok) TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR- 10BitTagComp- 10BitTagReq- OBFF Via message, ExtFmt- EETLPPrefix- EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit- FRS- TPHComp- ExtTPHComp- AtomicOpsCap: 32bit- 64bit- 128bitCAS- DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis- LTR- OBFF Disabled, AtomicOpsCtl: ReqEn- LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS- LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+ EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest- Retimer- 2Retimers- CrosslinkRes: unsupported Capabilities: [100 v1] Virtual Channel Caps: LPEVC=0 RefClk=100ns PATEntryBits=1 Arb: Fixed- WRR32- WRR64- WRR128- Ctrl: ArbSelect=Fixed Status: InProgress- VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans- Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256- Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=ff Status: NegoPending- InProgress- Capabilities: [258 v1] L1 PM Substates L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+ PortCommonModeRestoreTime=255us PortTPowerOnTime=10us L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1- T_CommonMode=0us LTR1.2_Threshold=0ns L1SubCtl2: T_PwrOn=10us Capabilities: [128 v1] Power Budgeting > Capabilities: [420 v2] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP+ FCP+ CmpltTO+ CmpltAbrt+ UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- CEMsk: RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ AdvNonFatalErr+ AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn- MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap- HeaderLog: 00000000 00000000 00000000 00000000 Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 > Capabilities: [900 v1] Secondary PCI Express LnkCtl3: LnkEquIntrruptEn- PerformEqu- LaneErrStat: 0 Capabilities: [ac0 v1] Designated Vendor-Specific: Vendor=10de ID=0001 Rev=1 Len=12 <?> Kernel driver in use: nvidia Kernel modules: nouveau, nvidia_drm, nvidia 00: de 10 b6 1d 07 04 10 00 a1 00 02 03 00 00 00 00 10: 00 00 00 ab 0c 00 00 00 20 38 00 00 0c 00 00 00 20: 28 38 00 00 00 00 00 00 00 00 00 00 de 10 4a 12 30: 00 00 00 00 60 00 00 00 00 00 00 00 0b 01 00 00 40: de 10 4a 12 00 00 00 00 00 00 00 00 00 00 00 00 50: 03 00 00 00 01 00 00 00 ce d6 23 00 00 00 00 00 60: 01 68 03 00 08 00 00 00 05 78 81 00 78 00 e0 fe 70: 00 00 00 00 00 00 00 00 10 00 02 00 e1 8d 2c 01 80: 3e 21 00 00 03 41 45 00 40 01 03 11 00 00 00 00 90: 00 00 00 00 00 00 00 00 00 00 00 00 13 00 04 00 a0: 06 00 00 00 0e 00 00 00 03 00 1f 00 00 00 00 00 b0: 00 00 00 00 09 00 14 01 00 00 10 80 00 00 00 00 c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f0: 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

william0212 avatar Sep 16 '21 01:09 william0212

Another information I want to tell you about Node Feature Discovery. Installing the version of 4.8.0 by Red Hat from OperatorHub of OKD, I find that today, all the nfd-worker in the namespace of openshift-operators is CrashLoopBackOff and shows log below: 1 nfd-worker.go:186] Node Feature Discovery Worker 1.16 I0916 01:06:26.742837 1 nfd-worker.go:187] NodeName: 'worker200.okd.med.thu' I0916 01:06:26.743197 1 nfd-worker.go:422] configuration file "/etc/kubernetes/node-feature-discovery/nfd-worker.conf" parsed I0916 01:06:26.743224 1 nfd-worker.go:457] worker (re-)configuration successfully completed I0916 01:06:26.743253 1 nfd-worker.go:316] connecting to nfd-master at nfd-master:12000 ... I0916 01:06:26.743271 1 clientconn.go:245] parsed scheme: "" I0916 01:06:26.743281 1 clientconn.go:251] scheme "" not registered, fallback to default scheme I0916 01:06:26.743307 1 resolver_conn_wrapper.go:172] ccResolverWrapper: sending update to cc: {[{nfd-master:12000 0 }] } I0916 01:06:26.743315 1 clientconn.go:674] ClientConn switching balancer to "pick_first" I0916 01:06:26.747659 1 nfd-worker.go:468] starting feature discovery... I0916 01:06:26.784109 1 nfd-worker.go:480] feature discovery completed I0916 01:06:26.784132 1 nfd-worker.go:550] sending labeling request to nfd-master E0916 01:06:26.788670 1 nfd-worker.go:557] failed to set node labels: rpc error: code = Unknown desc = nodes "worker200.okd.med.thu" is forbidden: User "system:serviceaccount:openshift-nfd:nfd-master" cannot get resource "nodes" in API group "" at the cluster scope I0916 01:06:26.788711 1 nfd-worker.go:330] closing connection to nfd-master ... F0916 01:06:26.788732 1 main.go:63] failed to advertise labels: rpc error: code = Unknown desc = nodes "worker200.okd.med.thu" is forbidden: User "system:serviceaccount:openshift-nfd:nfd-master" cannot get resource "nodes" in API group "" at the cluster scope Is this the reason of this problem and how to fix it ? Thanks again for your help .

william0212 avatar Sep 16 '21 01:09 william0212

Today, I do uninstall the NFD operator by Red Hat and install the official NFD(v0.9.0).All the pods are running. But after that , I use the command: helm install --wait --generate-name
./gpu-operator \
--set nfd.enabled=false \ (because I have deployed above) --set operator.defaultRuntime=crio
--set driver.enabled=false (because I have install on the local machine) The result is the same . It is not download the cuda image 11.4.1-base-ubi8 I will show you the yaml of nvidia-cuda-validation: kind: Pod apiVersion: v1 metadata: generateName: nvidia-cuda-validator- annotations: k8s.ovn.org/pod-networks: >- {"default":{"ip_addresses":["10.143.0.189/23"],"mac_address":"0a:58:0a:8f:00:bd","gateway_ips":["10.143.0.1"],"ip_address":"10.143.0.189/23","gateway_ip":"10.143.0.1"}} k8s.v1.cni.cncf.io/network-status: |- [{ "name": "", "interface": "eth0", "ips": [ "10.143.0.189" ], "mac": "0a:58:0a:8f:00:bd", "default": true, "dns": {} }] k8s.v1.cni.cncf.io/networks-status: |- [{ "name": "", "interface": "eth0", "ips": [ "10.143.0.189" ], "mac": "0a:58:0a:8f:00:bd", "default": true, "dns": {} }] openshift.io/scc: restricted selfLink: /api/v1/namespaces/gpu-operator-resources/pods/nvidia-cuda-validator-wvcbh resourceVersion: '10539813' name: nvidia-cuda-validator-wvcbh uid: c852a397-37b3-45aa-8c1a-4a3874a65098 creationTimestamp: '2021-09-16T12:47:59Z' managedFields: - manager: nvidia-validator operation: Update apiVersion: v1 time: '2021-09-16T12:47:59Z' fieldsType: FieldsV1 fieldsV1: 'f:metadata': 'f:generateName': {} 'f:labels': .: {} 'f:app': {} 'f:ownerReferences': .: {} 'k:{"uid":"6d897169-f161-4003-840c-0a0760d7aa65"}': .: {} 'f:apiVersion': {} 'f:blockOwnerDeletion': {} 'f:controller': {} 'f:kind': {} 'f:name': {} 'f:uid': {} 'f:spec': 'f:nodeName': {} 'f:containers': 'k:{"name":"nvidia-cuda-validator"}': 'f:image': {} 'f:terminationMessagePolicy': {} .: {} 'f:resources': {} 'f:args': {} 'f:command': {} 'f:securityContext': .: {} 'f:allowPrivilegeEscalation': {} 'f:terminationMessagePath': {} 'f:imagePullPolicy': {} 'f:name': {} 'f:dnsPolicy': {} 'f:tolerations': {} 'f:serviceAccount': {} 'f:restartPolicy': {} 'f:schedulerName': {} 'f:terminationGracePeriodSeconds': {} 'f:initContainers': .: {} 'k:{"name":"cuda-validation"}': 'f:image': {} 'f:terminationMessagePolicy': {} .: {} 'f:resources': {} 'f:args': {} 'f:command': {} 'f:securityContext': .: {} 'f:allowPrivilegeEscalation': {} 'f:terminationMessagePath': {} 'f:imagePullPolicy': {} 'f:name': {} 'f:serviceAccountName': {} 'f:enableServiceLinks': {} 'f:securityContext': .: {} 'f:fsGroup': {} 'f:seLinuxOptions': 'f:level': {} - manager: ovnkube operation: Update apiVersion: v1 time: '2021-09-16T12:47:59Z' fieldsType: FieldsV1 fieldsV1: 'f:metadata': 'f:annotations': 'f:k8s.ovn.org/pod-networks': {} - manager: multus operation: Update apiVersion: v1 time: '2021-09-16T12:48:01Z' fieldsType: FieldsV1 fieldsV1: 'f:metadata': 'f:annotations': 'f:k8s.v1.cni.cncf.io/network-status': {} 'f:k8s.v1.cni.cncf.io/networks-status': {} - manager: kubelet operation: Update apiVersion: v1 time: '2021-09-16T12:48:02Z' fieldsType: FieldsV1 fieldsV1: 'f:status': 'f:conditions': .: {} 'k:{"type":"ContainersReady"}': .: {} 'f:lastProbeTime': {} 'f:lastTransitionTime': {} 'f:message': {} 'f:reason': {} 'f:status': {} 'f:type': {} 'k:{"type":"Initialized"}': .: {} 'f:lastProbeTime': {} 'f:lastTransitionTime': {} 'f:message': {} 'f:reason': {} 'f:status': {} 'f:type': {} 'k:{"type":"PodScheduled"}': .: {} 'f:lastProbeTime': {} 'f:lastTransitionTime': {} 'f:status': {} 'f:type': {} 'k:{"type":"Ready"}': .: {} 'f:lastProbeTime': {} 'f:lastTransitionTime': {} 'f:message': {} 'f:reason': {} 'f:status': {} 'f:type': {} 'f:containerStatuses': {} 'f:hostIP': {} 'f:initContainerStatuses': {} 'f:podIP': {} 'f:podIPs': .: {} 'k:{"ip":"10.143.0.189"}': .: {} 'f:ip': {} 'f:startTime': {} namespace: gpu-operator-resources ownerReferences: - apiVersion: nvidia.com/v1 kind: ClusterPolicy name: cluster-policy uid: 6d897169-f161-4003-840c-0a0760d7aa65 controller: true blockOwnerDeletion: true labels: app: nvidia-cuda-validator spec: restartPolicy: OnFailure initContainers: - resources: {} terminationMessagePath: /dev/termination-log name: cuda-validation command: - sh - '-c' securityContext: capabilities: drop: - KILL - MKNOD - SETGID - SETUID runAsUser: 1000700000 allowPrivilegeEscalation: false imagePullPolicy: IfNotPresent volumeMounts: - name: nvidia-operator-validator-token-ltf8d readOnly: true mountPath: /var/run/secrets/kubernetes.io/serviceaccount terminationMessagePolicy: File image: 'base.med.thu/gpuins/gpu-operator-validator:v1.8.1' args: - vectorAdd serviceAccountName: nvidia-operator-validator imagePullSecrets: - name: nvidia-operator-validator-dockercfg-v9vd8 priority: 0 schedulerName: default-scheduler enableServiceLinks: true terminationGracePeriodSeconds: 30 preemptionPolicy: PreemptLowerPriority nodeName: worker200.okd.med.thu securityContext: seLinuxOptions: level: 's0:c26,c25' fsGroup: 1000700000 containers: - resources: {} terminationMessagePath: /dev/termination-log name: nvidia-cuda-validator command: - sh - '-c' securityContext: capabilities: drop: - KILL - MKNOD - SETGID - SETUID runAsUser: 1000700000 allowPrivilegeEscalation: false imagePullPolicy: IfNotPresent volumeMounts: - name: nvidia-operator-validator-token-ltf8d readOnly: true mountPath: /var/run/secrets/kubernetes.io/serviceaccount terminationMessagePolicy: File image: 'base.med.thu/gpuins/gpu-operator-validator:v1.8.1' args: - echo cuda workload validation is successful serviceAccount: nvidia-operator-validator volumes: - name: nvidia-operator-validator-token-ltf8d secret: secretName: nvidia-operator-validator-token-ltf8d defaultMode: 420 dnsPolicy: ClusterFirst tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule - key: node.kubernetes.io/not-ready operator: Exists effect: NoExecute tolerationSeconds: 300 - key: node.kubernetes.io/unreachable operator: Exists effect: NoExecute tolerationSeconds: 300 status: containerStatuses: - name: nvidia-cuda-validator state: waiting: reason: PodInitializing lastState: {} ready: false restartCount: 0 image: 'base.med.thu/gpuins/gpu-operator-validator:v1.8.1' imageID: '' started: false qosClass: BestEffort podIPs: - ip: 10.143.0.189 podIP: 10.143.0.189 hostIP: 172.28.201.200 startTime: '2021-09-16T12:47:59Z' initContainerStatuses: - name: cuda-validation state: terminated: exitCode: 1 reason: Error startedAt: '2021-09-16T12:51:08Z' finishedAt: '2021-09-16T12:51:08Z' containerID: >- cri-o://8e9118bfd55c231429750388203f9a5281609a43fa21378b89feb816b5aadda4 lastState: terminated: exitCode: 1 reason: Error startedAt: '2021-09-16T12:49:37Z' finishedAt: '2021-09-16T12:49:37Z' containerID: >- cri-o://6831c452a768f387cedd0db473d1bf33b5d308fe0fc1ee9cdd1836ee798810c4 ready: false restartCount: 5 image: 'base.med.thu/gpuins/gpu-operator-validator:v1.8.1' imageID: >- base.med.thu/gpuins/gpu-operator-validator@sha256:7a70e95fd19c3425cd4394f4b47bbf2119a70bd22d67d72e485b4d730853262c containerID: 'cri-o://8e9118bfd55c231429750388203f9a5281609a43fa21378b89feb816b5aadda4' conditions: - type: Initialized status: 'False' lastProbeTime: null lastTransitionTime: '2021-09-16T12:47:59Z' reason: ContainersNotInitialized message: 'containers with incomplete status: [cuda-validation]' - type: Ready status: 'False' lastProbeTime: null lastTransitionTime: '2021-09-16T12:47:59Z' reason: ContainersNotReady message: 'containers with unready status: [nvidia-cuda-validator]' - type: ContainersReady status: 'False' lastProbeTime: null lastTransitionTime: '2021-09-16T12:47:59Z' reason: ContainersNotReady message: 'containers with unready status: [nvidia-cuda-validator]' - type: PodScheduled status: 'True' lastProbeTime: null lastTransitionTime: '2021-09-16T12:47:59Z' phase: Pending

william0212 avatar Sep 16 '21 12:09 william0212

helm install --wait --generate-name ./gpu-operator
--set nfd.enabled=false \ (because I have deployed above) --set operator.defaultRuntime=crio --set driver.enabled=false (because I have install on the local machine)

For Helm Install on OCP you have to override toolkit/dcgm images as well.

helm install gpu-operator nvidia/gpu-operator --version=1.8.2 --set platform.openshift=true,operator.defaultRuntime=crio,nfd.enabled=false,toolkit.version=1.7.1-ubi8,dcgmExporter.version=2.2.9-2.4.0-ubi8,dcgm.version=2.2.3-ubi8,migManager.version=v0.1.3-ubi8

shivamerla avatar Sep 28 '21 00:09 shivamerla

@shivamerla I followed your instruction. The result is as same as before. The nvidia-cuda-validator init error . It didn't download the image of cuda . Please help me to find who control it for downloading the cuda image. I think it is the problem . Or is there some config problem of node-feature-discovery ? or gpu-feature-discover? 03b91a96469c7ee62ebd9d4a9146462 This is the log from nvidia-container-toolkit: there is some error. nvidia-container-toolkit-daemonset-nv6jk-nvidia-container-toolkit-ctr.log

william0212 avatar Oct 04 '21 10:10 william0212

@william0212 cuda-validator pod doesn't download cuda images, we have vectorAdd sample within gpu-operator-validator image which gets invoked at runtime. Wondering if cuda 11.4.1 package installed directly on host is causing any of this.

We should see toolkit logs on the host by adding debug fields as below.

$ cat /usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml
disable-require = false

[nvidia-container-cli]
  debug = "/var/log/nvidia-container-cli.log"
  environment = []
  ldconfig = "@/run/nvidia/driver/sbin/ldconfig"
  load-kmods = true
  path = "/usr/local/nvidia/toolkit/nvidia-container-cli"
  root = "/run/nvidia/driver"

[nvidia-container-runtime]
  debug = "/var/log/nvidia-container-runtime.log"

[core@ocp-mgmt-host ~]$ 
[core@ocp-mgmt-host ~]$ oc get pods -n gpu-operator-resources
NAME                                       READY   STATUS      RESTARTS   AGE
gpu-feature-discovery-tm7nr                1/1     Running     2          6d20h
nvidia-container-toolkit-daemonset-xprxd   1/1     Running     0          6d20h
nvidia-cuda-validator-5xgst                0/1     Completed   0          6d20h
nvidia-dcgm-exporter-v29mn                 1/1     Running     0          6d20h
nvidia-dcgm-q5lz7                          1/1     Running     1          6d20h
nvidia-device-plugin-daemonset-92q8r       1/1     Running     1          6d20h
nvidia-device-plugin-validator-5lk29       0/1     Completed   0          6d20h
nvidia-driver-daemonset-p4cvr              1/1     Running     0          6d20h
nvidia-node-status-exporter-jc6zz          1/1     Running     0          6d20h
nvidia-operator-validator-xgmtj            1/1     Running     0          6d20h

[core@ocp-mgmt-host ~]$ oc delete pod nvidia-operator-validator-xgmtj -n gpu-operator-resources
pod "nvidia-operator-validator-xgmtj" deleted

[core@ocp-mgmt-host ~]$ ls -ltr /var/log/nvidia-container*
-rw-r--r--. 1 root root 154810 Oct  4 20:42 /var/log/nvidia-container-cli.log

shivamerla avatar Oct 04 '21 20:10 shivamerla

in a project we face same issue, to fix, try to uninstall the NVIDIA driver from the node, and let driver: true and choose the right version of the driver (not all the Nvidia Driver has corresponding driver image) and this (make the driver: true) lets the GPU Operator install the driver and Cuda itself, I also think, when we set driver:false then we should also make the Cuda Validator off.

khanof avatar Jan 19 '22 00:01 khanof

I have 3 nodes with Tesla-T4, A100 and A30. With Tesla-T4 nvidia-cuda-validator completed successfully, but with A100 and A30 nvidia-cuda-validator keeps crashlooping. "[Vector addition of 50000 elements] Failed to allocate vector A (error code initialization error)!" is in cuda-validator container's log. Is there any way to fix?

Muscule avatar Aug 04 '22 14:08 Muscule