gpu-operator GPU-Operater v1.11.1 on OKD 4.10 Cluster, Nvidia-driver-daemonset is not created

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Quick Debug Checklist

[ ] Are you running on an Ubuntu 18.04 node?
[ ] Are you running Kubernetes v1.13+?
[ ] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
[ ] Do you have i2c_core and ipmi_msghandler loaded on the nodes?
[ ] Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

1. Issue or feature description

Trying to install GPU-Operator on my new OKD 4.10 Cluster but it is fail to create nvidia-driver-daemonset pod. My environment is:

Kubernetes Version : 1.23.5
OKD Version : 4.10.0-0.okd-2022-07-09-073606
GPU : Nvidia T4 GPU
CRIO Version : 1.23.3

The installation was trying by Helm as helm install with gpu-operator v1.11.1. On Nvidia-Driver-Daemonset, it shows CreateContainerError with below error:

Error: container create failed: time="2022-10-13T06:41:10Z" level=warning msg="freezer not supported: openat2 /sys/fs/cgroup/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod3f47f193_f192_4844_908f_3ad77033d806.slice/crio-9933f072f606395d26b84c8a6c3d5661452848b175a767a475e145ae4f2554e6.scope/cgroup.freeze: no such file or directory" time="2022-10-13T06:41:10Z" level=warning msg="lstat /sys/fs/cgroup/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod3f47f193_f192_4844_908f_3ad77033d806.slice/crio-9933f072f606395d26b84c8a6c3d5661452848b175a767a475e145ae4f2554e6.scope: no such file or directory" time="2022-10-13T06:41:10Z" level=error msg="runc create failed: unable to start container process: exec: \"ocp_dtk_entrypoint\": executable file not found in $PATH"

On Worker Node, I try to find the crio-xxx file on /sys/fs/cgroup/kubepods.slice/ and It seems creating file as /sys/fs/cgroup/kubepods.slice/crio-conmon-xxxxx/ instead of /sys/fs/cgroup/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod3f47f193_f192_4844_908f_3ad77033d806.slice/crio-.

Although I try to chang the crio.conf file from conmon_cgroup = "pod" to conmon_cgroup = "kubepods.slice" , the pod have same error.

Should I change other config from crio.conf? How can I solve this?

Please help me.

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

[ ] kubernetes pods status: kubectl get pods --all-namespaces
[ ] kubernetes daemonset status: kubectl get ds --all-namespaces
[ ] If a pod/ds is in an error state or pending state kubectl describe pod -n NAMESPACE POD_NAME
[ ] If a pod/ds is in an error state or pending state kubectl logs -n NAMESPACE POD_NAME
[ ] Output of running a container on the GPU machine: docker run -it alpine echo foo
[ ] Docker configuration file: cat /etc/docker/daemon.json
[ ] Docker runtime configuration: docker info | grep runtime
[ ] NVIDIA shared directory: ls -la /run/nvidia
[ ] NVIDIA packages directory: ls -la /usr/local/nvidia/toolkit
[ ] NVIDIA driver directory: ls -la /run/nvidia/driver
[ ] kubelet logs journalctl -u kubelet > kubelet.logs

Oct 13 '22 07:10 rupang790

@rupang790 Looks like operator. use_ocp_driver_toolkit is set to true in ClusterPolicy(CR) while the driver image is not RHCOS one. Please set this to false if you are using other OS for worker node with OKD.

Oct 17 '22 05:10 shivamerla

@shivamerla Thank you for your reply. I tried to change that you recommend to change use_ocp_driver_toolkit=false on values.yaml as below, but it still have same issue.

$ vi gpu-operator/values.yaml
...
operator:
  repository: nvcr.io/nvidia
  image: gpu-operator
  # If version is not specified, then default is to use chart.AppVersion
  #version: ""
  imagePullPolicy: IfNotPresent
  imagePullSecrets: []
  priorityClassName: system-node-critical
  defaultRuntime: crio
  runtimeClass: nvidia
  use_ocp_driver_toolkit: false
...

Oct 18 '22 02:10 rupang790

This issue was clear by change use_ocp_driver_toolkit: false on ClusterPolicy.yaml directly. When I set it on values.yaml, it was not work. So now it is clear and can use GPU-Operator well.

Oct 20 '22 07:10 rupang790