Shiva Krishna Merla
Shiva Krishna Merla
@lohwanjing can you run `oc get imagesstream driver-toolkit -n openshift -o yaml`. Also logs from the operator pod will help.
@hoangtnm Can you confirm the OS version you are using along with runtime(containerd, docker) version? Also, is cgroup v2 enabled on the nodes? (i.e **systemd.unified_cgroup_hierarchy=1** kernel command line is passed...
@xhejtman This is the expected behavior with driver container root under `/run/nvidia/driver`. If driver is directly installed on the node, then we would see `/dev/nvidia*` device nodes.
@guyst16 currently we don't support changing these but you can create custom rules based on the ones provided by the operator. Will also look into allowing this change with the...
@ReyRen from the debug bundle provided looks like driver pod logs are truncated. Can you get logs from "nvidia-driver-ctr" container within the driver pod. Looks like NVIDIA driver install is...
@shnigam2 Can you provide logs from the k8s controller-manager pod to check for errors on cleaning up these pods? Are you using images from private registry(i.e using pullSecrets)?
@shnigam2 we have a known issue which will be [fixed](https://github.com/NVIDIA/gpu-operator/commit/e67adb4d44dc3004b34a1ab9d7066588895f496e) in next patch v23.9.1 (later this month). The problem is we are adding duplicate pullSecrets in the spec. You can...
@qasmi please refer to [this](https://github.com/NVIDIA/gpu-operator/blob/master/assets/state-mig-manager/0400_configmap.yaml) file for supported profiles with each GPU type. Ideally mig-manager should have errored out with invalid configuration instead of marking it as success. We will...
@qasmi glad to know its working with correct config now, will look into propagating error in case of invalid configs.
@yuzs2 can you check for any errors in the dmesg from nvidia-gridd. `dmesg | grep -i gridd`