GPU-Operator on OKD 4.5 cluster in restricted Network
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Quick Debug Checklist
- [ ] Are you running on an Ubuntu 18.04 node?
- [ ] Are you running Kubernetes v1.13+?
- [ ] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
- [ ] Do you have
i2c_coreandipmi_msghandlerloaded on the nodes? - [ ] Did you apply the CRD (
kubectl describe clusterpolicies --all-namespaces)
1. Issue or feature description
I am trying to install GPU-Operator on OKD 4.5 cluster in Restricted Network environment.
For that, I clone the nvidia/gpu-operator and change some values on values.yaml and operator.yaml for my cluster.
So tried to install by helm install --devel ./gpu-operator --set platform.openshift=true,operator.defaultRuntime=crio,toolkit.version=1.3.0-ubi8,nfd.enabled=false --wait --generate-name, check gpu-operator was running well (without any error) but there are no "gpu-operator-resources" namespaces and pods such as dcmg, toolkits and validation etc..
I already check that gpu-operator installation with helm install nvidia/gpu-operator. What am I Missing ?
2. Steps to reproduce the issue
git clone https://github.com/NVIDIA/gpu-operator.git- Pull all Images and push to my local registry(Restricted)
- Change values.yaml and operator.yaml (Image url and some values)
helm installto install
3. Information to attach (optional if deemed irrelevant)
-
[ ] kubernetes pods status:
kubectl get pods --all-namespaces -
[ ] kubernetes daemonset status:
kubectl get ds --all-namespaces -
[ ] If a pod/ds is in an error state or pending state
kubectl describe pod -n NAMESPACE POD_NAME -
[ ] If a pod/ds is in an error state or pending state
kubectl logs -n NAMESPACE POD_NAME -
[ ] Output of running a container on the GPU machine:
docker run -it alpine echo foo -
[ ] Docker configuration file:
cat /etc/docker/daemon.json -
[ ] Docker runtime configuration:
docker info | grep runtime -
[ ] NVIDIA shared directory:
ls -la /run/nvidia -
[ ] NVIDIA packages directory:
ls -la /usr/local/nvidia/toolkit -
[ ] NVIDIA driver directory:
ls -la /run/nvidia/driver -
[ ] kubelet logs
journalctl -u kubelet > kubelet.logs
Can you attach logs of gpu-operator pod to debug?
@shivamerla, Sorry I forgot to attach the logs of gpu-operator pod. gpu-operator-76d5d98454-6g727-gpu-operator.log
@rupang790 Looks like you are installing very old version(--devel). Any reason for that? Its failing because gpu-operator-resources namespace is missing. Newer versions create this namespace automatically.
{"level":"info","ts":1625012388.7674541,"logger":"controller_clusterpolicy","msg":"Couldn't create","ServiceAccount":"nvidia-driver","Namespace":"gpu-operator-resources","Error":"namespaces \"gpu-operator-resources\" not found"}
{"level":"error","ts":1625012388.7676754,"logger":"controller-runtime.controller","msg":"Reconciler error","controller":"clusterpolicy-controller","request":"/cluster-policy","error":"namespaces \"gpu-operator-resources\" not found","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:258\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:232\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:88"}
@shivamerla, few weeks ago, I tried to install newer version of gpu-operator(v1.6.0 maybe) on OKD Cluster 4.5.0-0.okd-2020-10-15-235428 and some pods were not running well.( sorry I do not have any logs about that) Then I installed 1.3.0 and it worked well. That is reason why I am using 1.3.0 on my cluster.
@shivamerla, As you said I used very old version. So I am trying to install 1.7.1 version on my cluster, but it seems having issue with toolkit.
nvidia-operator-validator pod is stuck on Init:CrashLoopBackOff status and I can see the error on toolkit validator as below:

On nvidia-container-toolkit-daemonset pod, driver-validation container shows results of nvidia-smi command and nvidia-container-toolkit-ctr container shows logs as:
time="2021-07-07T04:22:19Z" level=info msg="Starting nvidia-toolkit"
time="2021-07-07T04:22:19Z" level=info msg="Parsing arguments"
time="2021-07-07T04:22:19Z" level=info msg="Verifying Flags"
time="2021-07-07T04:22:19Z" level=info msg=Initializing
time="2021-07-07T04:22:19Z" level=info msg="Installing toolkit"
time="2021-07-07T04:22:19Z" level=info msg="Parsing arguments: [/usr/local/nvidia/toolkit]"
time="2021-07-07T04:22:19Z" level=info msg="Successfully parsed arguments"
time="2021-07-07T04:22:19Z" level=info msg="Installing NVIDIA container toolkit to '/usr/local/nvidia/toolkit'"
time="2021-07-07T04:22:19Z" level=info msg="Removing existing NVIDIA container toolkit installation"
time="2021-07-07T04:22:19Z" level=info msg="Creating directory '/usr/local/nvidia/toolkit'"
time="2021-07-07T04:22:19Z" level=info msg="Creating directory '/usr/local/nvidia/toolkit/.config/nvidia-container-runtime'"
time="2021-07-07T04:22:19Z" level=info msg="Installing NVIDIA container library to '/usr/local/nvidia/toolkit'"
time="2021-07-07T04:22:19Z" level=info msg="Checking library candidate '/usr/lib64/libnvidia-container.so.1'"
time="2021-07-07T04:22:19Z" level=info msg="Resolved link: '/usr/lib64/libnvidia-container.so.1' => '/usr/lib64/libnvidia-container.so.1.4.0'"
time="2021-07-07T04:22:19Z" level=info msg="Installing '/usr/lib64/libnvidia-container.so.1.4.0' to '/usr/local/nvidia/toolkit/libnvidia-container.so.1.4.0'"
time="2021-07-07T04:22:19Z" level=info msg="Installed '/usr/lib64/libnvidia-container.so.1' to '/usr/local/nvidia/toolkit/libnvidia-container.so.1.4.0'"
time="2021-07-07T04:22:19Z" level=info msg="Creating symlink '/usr/local/nvidia/toolkit/libnvidia-container.so.1' -> 'libnvidia-container.so.1.4.0'"
time="2021-07-07T04:22:19Z" level=info msg="Installing NVIDIA container runtime from '/usr/bin/nvidia-container-runtime'"
time="2021-07-07T04:22:19Z" level=info msg="Installing executable '/usr/bin/nvidia-container-runtime'"
time="2021-07-07T04:22:19Z" level=info msg="Installing '/usr/bin/nvidia-container-runtime' to '/usr/local/nvidia/toolkit/nvidia-container-runtime.real'"
time="2021-07-07T04:22:19Z" level=info msg="Created '/usr/local/nvidia/toolkit/nvidia-container-runtime.real'"
time="2021-07-07T04:22:19Z" level=info msg="Created wrapper '/usr/local/nvidia/toolkit/nvidia-container-runtime'"
time="2021-07-07T04:22:19Z" level=info msg="Installing NVIDIA container CLI from '/usr/bin/nvidia-container-cli'"
time="2021-07-07T04:22:19Z" level=info msg="Installing executable '/usr/bin/nvidia-container-cli'"
time="2021-07-07T04:22:19Z" level=info msg="Installing '/usr/bin/nvidia-container-cli' to '/usr/local/nvidia/toolkit/nvidia-container-cli.real'"
time="2021-07-07T04:22:19Z" level=info msg="Created '/usr/local/nvidia/toolkit/nvidia-container-cli.real'"
time="2021-07-07T04:22:19Z" level=info msg="Created wrapper '/usr/local/nvidia/toolkit/nvidia-container-cli'"
time="2021-07-07T04:22:19Z" level=info msg="Installing NVIDIA container runtime hook from '/usr/bin/nvidia-container-toolkit'"
time="2021-07-07T04:22:19Z" level=info msg="Installing executable '/usr/bin/nvidia-container-toolkit'"
time="2021-07-07T04:22:19Z" level=info msg="Installing '/usr/bin/nvidia-container-toolkit' to '/usr/local/nvidia/toolkit/nvidia-container-toolkit.real'"
time="2021-07-07T04:22:19Z" level=info msg="Created '/usr/local/nvidia/toolkit/nvidia-container-toolkit.real'"
time="2021-07-07T04:22:19Z" level=info msg="Created wrapper '/usr/local/nvidia/toolkit/nvidia-container-toolkit'"
time="2021-07-07T04:22:19Z" level=info msg="Creating symlink '/usr/local/nvidia/toolkit/nvidia-container-runtime-hook' -> 'nvidia-container-toolkit'"
time="2021-07-07T04:22:19Z" level=info msg="Installing NVIDIA container toolkit config '/usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml'"
time="2021-07-07T04:22:19Z" level=info msg="Setting up runtime"
time="2021-07-07T04:22:19Z" level=info msg="Parsing arguments: [/usr/local/nvidia/toolkit]"
time="2021-07-07T04:22:19Z" level=info msg="Successfully parsed arguments"
time="2021-07-07T04:22:19Z" level=info msg="Starting 'setup' for crio"
time="2021-07-07T04:22:19Z" level=info msg="Waiting for signal"
According to https://github.com/NVIDIA/gpu-operator/issues/167#issuecomment-808524121, I also tried 1.6.2 version but it shows error on validation pod as:

I changed config of CRI-O about hooks.d as /run/containers/oci/hooks.d and restart crio service.
How to solve this? If it is solved, I will not use old version for testing local helm installation.
@shivamerla, Install GPU-Operator version 1.5.2 on OKD 4.5 Cluster was succeeded. So GPU-Operator 1.5.2 version will be used for my project. For the restricted Installation of 1.5.2 GPU-Operator, I would like to confirm how to prepare and Install.
- Check all images, create local repository for it then push images to local repository.
- git clone https://github.com/NVIDIA/gpu-operator.git
- Change image repository on gpu-operator/deployment/gpu-operator/values.yaml
- Using helm CLI for install as below:
$ helm lint ./gpu-operator
$ kubectl create ns gpu-operator
$ helm install ./gpu-operator -n gpu-operator --version 1.5.2 --set operator.defaultRuntime=crio,toolkit.version=1.4.0-ubi8 --wait --generate-name
Because after that procedure, GPU-Operator occurred CrashLoopBackOFF and I saw the logs of pod:
$ kubectl logs -n gpu-operator gpu-operator-8678476587-jr24j
unknown flag: --leader-elect
Usage of gpu-operator:
unknown flag: --leader-elect
--zap-devel Enable zap development mode (changes defaults to console encoder, debug log level, disables sampling and stacktrace from 'warning' level)
--zap-encoder encoder Zap log encoding ('json' or 'console')
--zap-level level Zap log level (one of 'debug', 'info', 'error' or any integer value > 0) (default info)
--zap-sample sample Enable zap log sampling. Sampling will be disabled for integer log levels > 1
--zap-stacktrace-level level Set the minimum log level that triggers stacktrace generation (default error)
--zap-time-encoding timeEncoding Sets the zap time format ('epoch', 'millis', 'nano', or 'iso8601') (default )
And the events of GPU-Operator Namespaces as:
$ kubectl get events -n gpu-operator
LAST SEEN TYPE REASON OBJECT MESSAGE
3m43s Normal Scheduled pod/gpu-operator-8678476587-jr24j Successfully assigned gpu-operator/gpu-operator-8678476587-jr24j to k8s-master01
3m43s Normal AddedInterface pod/gpu-operator-8678476587-jr24j Add eth0 [10.244.32.142/32] from k8s-pod-network
2m8s Normal Pulled pod/gpu-operator-8678476587-jr24j Container image "mirror.eluon.okd.com:5000/nvidia/gpu-operator:1.5.2" already present on machine
2m8s Normal Created pod/gpu-operator-8678476587-jr24j Created container gpu-operator
2m7s Normal Started pod/gpu-operator-8678476587-jr24j Started container gpu-operator
2m6s Warning BackOff pod/gpu-operator-8678476587-jr24j Back-off restarting failed container
3m44s Normal SuccessfulCreate replicaset/gpu-operator-8678476587 Created pod: gpu-operator-8678476587-jr24j
3m44s Normal ScalingReplicaSet deployment/gpu-operator Scaled up replica set gpu-operator-8678476587 to 1
Do you have any idea about it?