gpu-operator GPU-Operator on OKD 4.5 cluster in restricted Network

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Quick Debug Checklist

[ ] Are you running on an Ubuntu 18.04 node?
[ ] Are you running Kubernetes v1.13+?
[ ] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
[ ] Do you have i2c_core and ipmi_msghandler loaded on the nodes?
[ ] Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

1. Issue or feature description

I am trying to install GPU-Operator on OKD 4.5 cluster in Restricted Network environment. For that, I clone the nvidia/gpu-operator and change some values on values.yaml and operator.yaml for my cluster. So tried to install by helm install --devel ./gpu-operator --set platform.openshift=true,operator.defaultRuntime=crio,toolkit.version=1.3.0-ubi8,nfd.enabled=false --wait --generate-name, check gpu-operator was running well (without any error) but there are no "gpu-operator-resources" namespaces and pods such as dcmg, toolkits and validation etc..

I already check that gpu-operator installation with helm install nvidia/gpu-operator. What am I Missing ?

2. Steps to reproduce the issue

git clone https://github.com/NVIDIA/gpu-operator.git
Pull all Images and push to my local registry(Restricted)
Change values.yaml and operator.yaml (Image url and some values)
helm install to install

3. Information to attach (optional if deemed irrelevant)

[ ] kubernetes pods status: kubectl get pods --all-namespaces
[ ] kubernetes daemonset status: kubectl get ds --all-namespaces
[ ] If a pod/ds is in an error state or pending state kubectl describe pod -n NAMESPACE POD_NAME
[ ] If a pod/ds is in an error state or pending state kubectl logs -n NAMESPACE POD_NAME
[ ] Output of running a container on the GPU machine: docker run -it alpine echo foo
[ ] Docker configuration file: cat /etc/docker/daemon.json
[ ] Docker runtime configuration: docker info | grep runtime
[ ] NVIDIA shared directory: ls -la /run/nvidia
[ ] NVIDIA packages directory: ls -la /usr/local/nvidia/toolkit
[ ] NVIDIA driver directory: ls -la /run/nvidia/driver
[ ] kubelet logs journalctl -u kubelet > kubelet.logs

Jun 29 '21 06:06 rupang790

Can you attach logs of gpu-operator pod to debug?

Jun 29 '21 14:06 shivamerla

@shivamerla, Sorry I forgot to attach the logs of gpu-operator pod. gpu-operator-76d5d98454-6g727-gpu-operator.log

Jun 30 '21 00:06 rupang790

@rupang790 Looks like you are installing very old version(--devel). Any reason for that? Its failing because gpu-operator-resources namespace is missing. Newer versions create this namespace automatically.

{"level":"info","ts":1625012388.7674541,"logger":"controller_clusterpolicy","msg":"Couldn't create","ServiceAccount":"nvidia-driver","Namespace":"gpu-operator-resources","Error":"namespaces \"gpu-operator-resources\" not found"}
{"level":"error","ts":1625012388.7676754,"logger":"controller-runtime.controller","msg":"Reconciler error","controller":"clusterpolicy-controller","request":"/cluster-policy","error":"namespaces \"gpu-operator-resources\" not found","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:258\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:232\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:88"}

Jun 30 '21 00:06 shivamerla

@shivamerla, few weeks ago, I tried to install newer version of gpu-operator(v1.6.0 maybe) on OKD Cluster 4.5.0-0.okd-2020-10-15-235428 and some pods were not running well.( sorry I do not have any logs about that) Then I installed 1.3.0 and it worked well. That is reason why I am using 1.3.0 on my cluster.

Jun 30 '21 01:06 rupang790

@shivamerla, As you said I used very old version. So I am trying to install 1.7.1 version on my cluster, but it seems having issue with toolkit. nvidia-operator-validator pod is stuck on Init:CrashLoopBackOff status and I can see the error on toolkit validator as below:

On nvidia-container-toolkit-daemonset pod, driver-validation container shows results of nvidia-smi command and nvidia-container-toolkit-ctr container shows logs as:

time="2021-07-07T04:22:19Z" level=info msg="Starting nvidia-toolkit"
time="2021-07-07T04:22:19Z" level=info msg="Parsing arguments"
time="2021-07-07T04:22:19Z" level=info msg="Verifying Flags"
time="2021-07-07T04:22:19Z" level=info msg=Initializing
time="2021-07-07T04:22:19Z" level=info msg="Installing toolkit"
time="2021-07-07T04:22:19Z" level=info msg="Parsing arguments: [/usr/local/nvidia/toolkit]"
time="2021-07-07T04:22:19Z" level=info msg="Successfully parsed arguments"
time="2021-07-07T04:22:19Z" level=info msg="Installing NVIDIA container toolkit to '/usr/local/nvidia/toolkit'"
time="2021-07-07T04:22:19Z" level=info msg="Removing existing NVIDIA container toolkit installation"
time="2021-07-07T04:22:19Z" level=info msg="Creating directory '/usr/local/nvidia/toolkit'"
time="2021-07-07T04:22:19Z" level=info msg="Creating directory '/usr/local/nvidia/toolkit/.config/nvidia-container-runtime'"
time="2021-07-07T04:22:19Z" level=info msg="Installing NVIDIA container library to '/usr/local/nvidia/toolkit'"
time="2021-07-07T04:22:19Z" level=info msg="Checking library candidate '/usr/lib64/libnvidia-container.so.1'"
time="2021-07-07T04:22:19Z" level=info msg="Resolved link: '/usr/lib64/libnvidia-container.so.1' => '/usr/lib64/libnvidia-container.so.1.4.0'"
time="2021-07-07T04:22:19Z" level=info msg="Installing '/usr/lib64/libnvidia-container.so.1.4.0' to '/usr/local/nvidia/toolkit/libnvidia-container.so.1.4.0'"
time="2021-07-07T04:22:19Z" level=info msg="Installed '/usr/lib64/libnvidia-container.so.1' to '/usr/local/nvidia/toolkit/libnvidia-container.so.1.4.0'"
time="2021-07-07T04:22:19Z" level=info msg="Creating symlink '/usr/local/nvidia/toolkit/libnvidia-container.so.1' -> 'libnvidia-container.so.1.4.0'"
time="2021-07-07T04:22:19Z" level=info msg="Installing NVIDIA container runtime from '/usr/bin/nvidia-container-runtime'"
time="2021-07-07T04:22:19Z" level=info msg="Installing executable '/usr/bin/nvidia-container-runtime'"
time="2021-07-07T04:22:19Z" level=info msg="Installing '/usr/bin/nvidia-container-runtime' to '/usr/local/nvidia/toolkit/nvidia-container-runtime.real'"
time="2021-07-07T04:22:19Z" level=info msg="Created '/usr/local/nvidia/toolkit/nvidia-container-runtime.real'"
time="2021-07-07T04:22:19Z" level=info msg="Created wrapper '/usr/local/nvidia/toolkit/nvidia-container-runtime'"
time="2021-07-07T04:22:19Z" level=info msg="Installing NVIDIA container CLI from '/usr/bin/nvidia-container-cli'"
time="2021-07-07T04:22:19Z" level=info msg="Installing executable '/usr/bin/nvidia-container-cli'"
time="2021-07-07T04:22:19Z" level=info msg="Installing '/usr/bin/nvidia-container-cli' to '/usr/local/nvidia/toolkit/nvidia-container-cli.real'"
time="2021-07-07T04:22:19Z" level=info msg="Created '/usr/local/nvidia/toolkit/nvidia-container-cli.real'"
time="2021-07-07T04:22:19Z" level=info msg="Created wrapper '/usr/local/nvidia/toolkit/nvidia-container-cli'"
time="2021-07-07T04:22:19Z" level=info msg="Installing NVIDIA container runtime hook from '/usr/bin/nvidia-container-toolkit'"
time="2021-07-07T04:22:19Z" level=info msg="Installing executable '/usr/bin/nvidia-container-toolkit'"
time="2021-07-07T04:22:19Z" level=info msg="Installing '/usr/bin/nvidia-container-toolkit' to '/usr/local/nvidia/toolkit/nvidia-container-toolkit.real'"
time="2021-07-07T04:22:19Z" level=info msg="Created '/usr/local/nvidia/toolkit/nvidia-container-toolkit.real'"
time="2021-07-07T04:22:19Z" level=info msg="Created wrapper '/usr/local/nvidia/toolkit/nvidia-container-toolkit'"
time="2021-07-07T04:22:19Z" level=info msg="Creating symlink '/usr/local/nvidia/toolkit/nvidia-container-runtime-hook' -> 'nvidia-container-toolkit'"
time="2021-07-07T04:22:19Z" level=info msg="Installing NVIDIA container toolkit config '/usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml'"
time="2021-07-07T04:22:19Z" level=info msg="Setting up runtime"
time="2021-07-07T04:22:19Z" level=info msg="Parsing arguments: [/usr/local/nvidia/toolkit]"
time="2021-07-07T04:22:19Z" level=info msg="Successfully parsed arguments"
time="2021-07-07T04:22:19Z" level=info msg="Starting 'setup' for crio"
time="2021-07-07T04:22:19Z" level=info msg="Waiting for signal"

According to https://github.com/NVIDIA/gpu-operator/issues/167#issuecomment-808524121, I also tried 1.6.2 version but it shows error on validation pod as:

I changed config of CRI-O about hooks.d as /run/containers/oci/hooks.d and restart crio service. How to solve this? If it is solved, I will not use old version for testing local helm installation.

Jul 07 '21 04:07 rupang790

@shivamerla, Install GPU-Operator version 1.5.2 on OKD 4.5 Cluster was succeeded. So GPU-Operator 1.5.2 version will be used for my project. For the restricted Installation of 1.5.2 GPU-Operator, I would like to confirm how to prepare and Install.

Check all images, create local repository for it then push images to local repository.
git clone https://github.com/NVIDIA/gpu-operator.git
Change image repository on gpu-operator/deployment/gpu-operator/values.yaml
Using helm CLI for install as below:

$ helm lint ./gpu-operator
$ kubectl create ns gpu-operator
$ helm install ./gpu-operator -n gpu-operator --version 1.5.2 --set operator.defaultRuntime=crio,toolkit.version=1.4.0-ubi8 --wait --generate-name

Because after that procedure, GPU-Operator occurred CrashLoopBackOFF and I saw the logs of pod:

$ kubectl logs -n gpu-operator gpu-operator-8678476587-jr24j
unknown flag: --leader-elect
Usage of gpu-operator:
unknown flag: --leader-elect
      --zap-devel                        Enable zap development mode (changes defaults to console encoder, debug log level, disables sampling and stacktrace from 'warning' level)
      --zap-encoder encoder              Zap log encoding ('json' or 'console')
      --zap-level level                  Zap log level (one of 'debug', 'info', 'error' or any integer value > 0) (default info)
      --zap-sample sample                Enable zap log sampling. Sampling will be disabled for integer log levels > 1
      --zap-stacktrace-level level       Set the minimum log level that triggers stacktrace generation (default error)
      --zap-time-encoding timeEncoding   Sets the zap time format ('epoch', 'millis', 'nano', or 'iso8601') (default )

And the events of GPU-Operator Namespaces as:

$ kubectl get events -n gpu-operator
LAST SEEN   TYPE      REASON              OBJECT                               MESSAGE
3m43s       Normal    Scheduled           pod/gpu-operator-8678476587-jr24j    Successfully assigned gpu-operator/gpu-operator-8678476587-jr24j to k8s-master01
3m43s       Normal    AddedInterface      pod/gpu-operator-8678476587-jr24j    Add eth0 [10.244.32.142/32] from k8s-pod-network
2m8s        Normal    Pulled              pod/gpu-operator-8678476587-jr24j    Container image "mirror.eluon.okd.com:5000/nvidia/gpu-operator:1.5.2" already present on machine
2m8s        Normal    Created             pod/gpu-operator-8678476587-jr24j    Created container gpu-operator
2m7s        Normal    Started             pod/gpu-operator-8678476587-jr24j    Started container gpu-operator
2m6s        Warning   BackOff             pod/gpu-operator-8678476587-jr24j    Back-off restarting failed container
3m44s       Normal    SuccessfulCreate    replicaset/gpu-operator-8678476587   Created pod: gpu-operator-8678476587-jr24j
3m44s       Normal    ScalingReplicaSet   deployment/gpu-operator              Scaled up replica set gpu-operator-8678476587 to 1

Do you have any idea about it?

Aug 13 '21 01:08 rupang790