gpu-operator
gpu-operator copied to clipboard
entitlement free system documentation out of date to work with RHCOS on OCP 4.11 (need to do manual tag)
When following the steps defined at: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/openshift/install-gpu-ocp.html#create-the-clusterpolicy-instance
I don't ultimately get an entitlement free build system. I do see the driver container
Tylers-MacBook-Pro:~ tylerlisowski$ oc get imagestream -n openshift driver-toolkit
NAME IMAGE REPOSITORY TAGS UPDATED
driver-toolkit image-registry.openshift-image-registry.svc:5000/openshift/driver-toolkit 411.86.202210032349-0,latest 55 minutes ago
I use Nvidia GPU operator 22.9
And then follow the following cluster policy steps for 4.11
oc get csv -n nvidia-gpu-operator gpu-operator-certified.v22.9.0 -ojsonpath={.metadata.annotations.alm-examples} | jq .[0] > clusterpolicy.json
oc apply -f clusterpolicy.json
clusterpolicy.nvidia.com/gpu-cluster-policy configured
However the driver container still tries to fallback and pull from yum repos:
error: a container name must be specified for pod nvidia-driver-daemonset-411.86.202210072320-0-lvwjl, choose one of: [nvidia-driver-ctr openshift-driver-toolkit-ctr] or one of the init containers: [k8s-driver-manager]
Tylers-MacBook-Pro:~ tylerlisowski$ oc logs -n nvidia-gpu-operator nvidia-driver-daemonset-411.86.202210072320-0-lvwjl -c nvidia-driver-ctr -f
Running nv-ctr-run-with-dtk
+ [[ true == \t\r\u\e ]]
+ echo 'WARNING: RHCOS '\''411.86.202210072320-0'\'' imagetag missing, using entitlement-based fallback'
WARNING: RHCOS '411.86.202210072320-0' imagetag missing, using entitlement-based fallback
+ exec bash -x nvidia-driver init
+ set -eu
Does a manual image tag need to be created to use this path?
I fixed by manually running a tag:
oc -n openshift tag quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1f5f1ae25db67aa82707e1b1dc96c8a53ef7094f320b7eeaef12be9a13fa251d driver-toolkit:411.86.202210072320-0
That I generated by looking at the existing driver-toolkit image in place with
Tylers-MacBook-Pro:~ tylerlisowski$ oc -n openshift get imagetag | grep driver
driver-toolkit:411.86.202210032349-0 Scheduled image/sha256:1f5f1ae25db67aa82707e1b1dc96c8a53ef7094f320b7eeaef12be9a13fa251d 1 About an hour ago
driver-toolkit:411.86.202210072320-0 Tag image/sha256:1f5f1ae25db67aa82707e1b1dc96c8a53ef7094f320b7eeaef12be9a13fa251d 1 3 minutes ago
driver-toolkit:latest Scheduled image/sha256:1f5f1ae25db67aa82707e1b1dc96c8a53ef7094f320b7eeaef12be9a13fa251d 1 About an hour ago
Thanks for reporting this @relyt0925. Working with RH to understand why the tag was missing from the imagestream. This version is picked from the NFD label feature.node.kubernetes.io/system-os_release.OSTREE_VERSION
and we expect this tag to be present to match the current running RHCOS version.
Thanks @shivamerla . Yeah we figured as much from a past conversation, but couldn't tell if we'd managed to hit some degenerate case bc of a particular version of either the operator or OpenShift itself. For our mutual clients, what would we want to offer as guidance, ie "if you see ^^, then ___." Would @relyt0925 's workaround be a reasonable interim fix?
@mikehollinger yes, the workaround to tag manually seems reasonable until the root cause is identified. @fabiendupont Can you help to identify why the version mismatch here? May be NFD didn't label the version correctly?
@relyt0925 @mikehollinger we have tried to reproduce this and see that tag 411.86.202210072320-0
was present by default.
$ oc get imagestream -n openshift driver-toolkit
NAME IMAGE REPOSITORY TAGS UPDATED
driver-toolkit 411.86.202210072320-0,latest 3 days ago
also, label:
feature.node.kubernetes.io/system-os_release.OSTREE_VERSION=411.86.202210072320-0
need to understand why the DTK tag is missing in your case. Do you see this happening with other clusters too?
@relyt0925 ^^ ?
I hit the same issue while installing the GPU Operator on IBM RH Openshift 4.10
Running nv-ctr-run-with-dtk
+ [[ true == \t\r\u\e ]]
+ echo 'WARNING: RHCOS '\''410.84.202303060052-0'\'' imagetag missing, using entitlement-based fallback'
+ exec bash -x nvidia-driver init
WARNING: RHCOS '410.84.202303060052-0' imagetag missing, using entitlement-based fallback
de213022@Rainers-MBP ~ % oc get imagestream -n openshift driver-toolkit
NAME IMAGE REPOSITORY TAGS UPDATED
driver-toolkit image-registry.openshift-image-registry.svc:5000/openshift/driver-toolkit 410.84.202302090253-0,latest 2 hours ago
Creating the tag manually fixed it.
@fabiendupont can you please check why the DTK image tags can be missing in this case?
@shivamerla it is expected in the Hypershift flavor of openshift: https://github.com/openshift/hypershift/tree/main/hypershift-operator/controllers
That masters and workers can be at different patch versions: hence the difference in tags and the need currently to do the additional tagging. We do have clusters that can be looked at to see this.
Ultimately I believe the operator has to have a bit more logic to appropriately choose the tags in these environments (and in the interim this workaround can be done)
Note: the algortihm and something that I believe should be handled long term in the NVIDIA gpu operator in hypershift environments is the following.
With NFD deployed each node is labeled with the appropriate OS tree version
feature.node.kubernetes.io/system-os_release.OSTREE_VERSION: 412.86.202310141028-0
(note that is the last part of the driver tag)
Then if the release image associated with the RHCOS node can be looked up the following can be ran
oc adm release info RELEASEIMAGE | grep driver-toolkit
example
oc adm release info us.icr.io/armada-master/ocp-release:4.12.47-x86_64 | grep driver-toolkit
driver-toolkit sha256:5e4c83a34f34bbb8d07891afa5090539aed9c0a6511c3be5655e18f1a32f90ab
with that data everything is there to perform tag
oc -n openshift tag quay.io/openshift-release-dev/ocp-v4.0-art-dev@DRIVER_TOOLKIT_SHA_VAL driver-toolkit:OSTREE_VERSION_VAL
from the example above:
oc -n openshift tag quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5e4c83a34f34bbb8d07891afa5090539aed9c0a6511c3be5655e18f1a32f90ab driver-toolkit:412.86.202310141028-0