GPU-Operater v1.11.1 on OKD 4.10 Cluster, Nvidia-driver-daemonset is not created
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Quick Debug Checklist
- [ ] Are you running on an Ubuntu 18.04 node?
- [ ] Are you running Kubernetes v1.13+?
- [ ] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
- [ ] Do you have
i2c_coreandipmi_msghandlerloaded on the nodes? - [ ] Did you apply the CRD (
kubectl describe clusterpolicies --all-namespaces)
1. Issue or feature description
Trying to install GPU-Operator on my new OKD 4.10 Cluster but it is fail to create nvidia-driver-daemonset pod. My environment is:
- Kubernetes Version : 1.23.5
- OKD Version : 4.10.0-0.okd-2022-07-09-073606
- GPU : Nvidia T4 GPU
- CRIO Version : 1.23.3
The installation was trying by Helm as helm install with gpu-operator v1.11.1.
On Nvidia-Driver-Daemonset, it shows CreateContainerError with below error:
Error: container create failed: time="2022-10-13T06:41:10Z" level=warning msg="freezer not supported: openat2 /sys/fs/cgroup/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod3f47f193_f192_4844_908f_3ad77033d806.slice/crio-9933f072f606395d26b84c8a6c3d5661452848b175a767a475e145ae4f2554e6.scope/cgroup.freeze: no such file or directory" time="2022-10-13T06:41:10Z" level=warning msg="lstat /sys/fs/cgroup/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod3f47f193_f192_4844_908f_3ad77033d806.slice/crio-9933f072f606395d26b84c8a6c3d5661452848b175a767a475e145ae4f2554e6.scope: no such file or directory" time="2022-10-13T06:41:10Z" level=error msg="runc create failed: unable to start container process: exec: \"ocp_dtk_entrypoint\": executable file not found in $PATH"
On Worker Node, I try to find the crio-xxx file on /sys/fs/cgroup/kubepods.slice/ and It seems creating file as /sys/fs/cgroup/kubepods.slice/crio-conmon-xxxxx/ instead of /sys/fs/cgroup/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod3f47f193_f192_4844_908f_3ad77033d806.slice/crio-.
Although I try to chang the crio.conf file from conmon_cgroup = "pod" to conmon_cgroup = "kubepods.slice" , the pod have same error.
Should I change other config from crio.conf?
How can I solve this?
Please help me.
2. Steps to reproduce the issue
3. Information to attach (optional if deemed irrelevant)
-
[ ] kubernetes pods status:
kubectl get pods --all-namespaces -
[ ] kubernetes daemonset status:
kubectl get ds --all-namespaces -
[ ] If a pod/ds is in an error state or pending state
kubectl describe pod -n NAMESPACE POD_NAME -
[ ] If a pod/ds is in an error state or pending state
kubectl logs -n NAMESPACE POD_NAME -
[ ] Output of running a container on the GPU machine:
docker run -it alpine echo foo -
[ ] Docker configuration file:
cat /etc/docker/daemon.json -
[ ] Docker runtime configuration:
docker info | grep runtime -
[ ] NVIDIA shared directory:
ls -la /run/nvidia -
[ ] NVIDIA packages directory:
ls -la /usr/local/nvidia/toolkit -
[ ] NVIDIA driver directory:
ls -la /run/nvidia/driver -
[ ] kubelet logs
journalctl -u kubelet > kubelet.logs
@rupang790 Looks like operator. use_ocp_driver_toolkit is set to true in ClusterPolicy(CR) while the driver image is not RHCOS one. Please set this to false if you are using other OS for worker node with OKD.
@shivamerla Thank you for your reply.
I tried to change that you recommend to change use_ocp_driver_toolkit=false on values.yaml as below, but it still have same issue.
$ vi gpu-operator/values.yaml
...
operator:
repository: nvcr.io/nvidia
image: gpu-operator
# If version is not specified, then default is to use chart.AppVersion
#version: ""
imagePullPolicy: IfNotPresent
imagePullSecrets: []
priorityClassName: system-node-critical
defaultRuntime: crio
runtimeClass: nvidia
use_ocp_driver_toolkit: false
...

This issue was clear by change use_ocp_driver_toolkit: false on ClusterPolicy.yaml directly. When I set it on values.yaml, it was not work. So now it is clear and can use GPU-Operator well.