Shiva Krishna Merla comments

Results 278 comments of


                                            Shiva Krishna Merla

Daemonset Unable to find RHCOS toolkit-driver image when installing on OKD 4.13

@lohwanjing can you run `oc get imagesstream driver-toolkit -n openshift -o yaml`. Also logs from the operator pod will help.

Failed to initialize NVML: Unknown Error

@hoangtnm Can you confirm the OS version you are using along with runtime(containerd, docker) version? Also, is cgroup v2 enabled on the nodes? (i.e **systemd.unified_cgroup_hierarchy=1** kernel command line is passed...

Failed to initialize NVML: Unknown Error

@xhejtman This is the expected behavior with driver container root under `/run/nvidia/driver`. If driver is directly installed on the node, then we would see `/dev/nvidia*` device nodes.

Overriding The PrometheusRule Objects Alerts

@guyst16 currently we don't support changing these but you can create custom rules based on the ones provided by the operator. Will also look into allowing this change with the...

Cannot establish GPU-operator with GDRDMA

@ReyRen from the debug bundle provided looks like driver pod logs are truncated. Can you get logs from "nvidia-driver-ctr" container within the driver pod. Looks like NVIDIA driver install is...

gpu-feature-discovery , nvidia-container-toolkit-daemonset , nvidia-device-plugin-daemonset & nvidia-driver-daemonset is not getting removed after GPU node get drained off from the cluster

@shnigam2 Can you provide logs from the k8s controller-manager pod to check for errors on cleaning up these pods? Are you using images from private registry(i.e using pullSecrets)?

gpu-feature-discovery , nvidia-container-toolkit-daemonset , nvidia-device-plugin-daemonset & nvidia-driver-daemonset is not getting removed after GPU node get drained off from the cluster

@shnigam2 we have a known issue which will be [fixed](https://github.com/NVIDIA/gpu-operator/commit/e67adb4d44dc3004b34a1ab9d7066588895f496e) in next patch v23.9.1 (later this month). The problem is we are adding duplicate pullSecrets in the spec. You can...

No MIG devices exist after successfully applying the MIG configuration with migmanager

@qasmi please refer to [this](https://github.com/NVIDIA/gpu-operator/blob/master/assets/state-mig-manager/0400_configmap.yaml) file for supported profiles with each GPU type. Ideally mig-manager should have errored out with invalid configuration instead of marking it as success. We will...

No MIG devices exist after successfully applying the MIG configuration with migmanager

@qasmi glad to know its working with correct config now, will look into propagating error in case of invalid configs.

How vGPU get licensed when using gpu-operator

@yuzs2 can you check for any errors in the dmesg from nvidia-gridd. `dmesg | grep -i gridd`