gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

entitlement free system documentation out of date to work with RHCOS on OCP 4.11 (need to do manual tag)

Open relyt0925 opened this issue 2 years ago • 12 comments

When following the steps defined at: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/openshift/install-gpu-ocp.html#create-the-clusterpolicy-instance

I don't ultimately get an entitlement free build system. I do see the driver container

Tylers-MacBook-Pro:~ tylerlisowski$ oc get imagestream -n openshift driver-toolkit
NAME             IMAGE REPOSITORY                                                            TAGS                           UPDATED
driver-toolkit   image-registry.openshift-image-registry.svc:5000/openshift/driver-toolkit   411.86.202210032349-0,latest   55 minutes ago

I use Nvidia GPU operator 22.9 Screen Shot 2022-10-26 at 8 31 14 PM

And then follow the following cluster policy steps for 4.11

oc get csv -n nvidia-gpu-operator gpu-operator-certified.v22.9.0 -ojsonpath={.metadata.annotations.alm-examples} | jq .[0] > clusterpolicy.json
oc apply -f clusterpolicy.json 
clusterpolicy.nvidia.com/gpu-cluster-policy configured

However the driver container still tries to fallback and pull from yum repos:

error: a container name must be specified for pod nvidia-driver-daemonset-411.86.202210072320-0-lvwjl, choose one of: [nvidia-driver-ctr openshift-driver-toolkit-ctr] or one of the init containers: [k8s-driver-manager]
Tylers-MacBook-Pro:~ tylerlisowski$ oc logs -n nvidia-gpu-operator nvidia-driver-daemonset-411.86.202210072320-0-lvwjl -c nvidia-driver-ctr -f
Running nv-ctr-run-with-dtk
+ [[ true == \t\r\u\e ]]
+ echo 'WARNING: RHCOS '\''411.86.202210072320-0'\'' imagetag missing, using entitlement-based fallback'
WARNING: RHCOS '411.86.202210072320-0' imagetag missing, using entitlement-based fallback
+ exec bash -x nvidia-driver init
+ set -eu

Does a manual image tag need to be created to use this path?

relyt0925 avatar Oct 27 '22 01:10 relyt0925

I fixed by manually running a tag:

oc -n openshift tag quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1f5f1ae25db67aa82707e1b1dc96c8a53ef7094f320b7eeaef12be9a13fa251d driver-toolkit:411.86.202210072320-0

That I generated by looking at the existing driver-toolkit image in place with

Tylers-MacBook-Pro:~ tylerlisowski$ oc -n openshift get imagetag | grep driver
driver-toolkit:411.86.202210032349-0                        Scheduled   image/sha256:1f5f1ae25db67aa82707e1b1dc96c8a53ef7094f320b7eeaef12be9a13fa251d   1         About an hour ago
driver-toolkit:411.86.202210072320-0                        Tag         image/sha256:1f5f1ae25db67aa82707e1b1dc96c8a53ef7094f320b7eeaef12be9a13fa251d   1         3 minutes ago
driver-toolkit:latest                                       Scheduled   image/sha256:1f5f1ae25db67aa82707e1b1dc96c8a53ef7094f320b7eeaef12be9a13fa251d   1         About an hour ago

relyt0925 avatar Oct 27 '22 01:10 relyt0925

Thanks for reporting this @relyt0925. Working with RH to understand why the tag was missing from the imagestream. This version is picked from the NFD label feature.node.kubernetes.io/system-os_release.OSTREE_VERSION and we expect this tag to be present to match the current running RHCOS version.

shivamerla avatar Oct 27 '22 16:10 shivamerla

Thanks @shivamerla . Yeah we figured as much from a past conversation, but couldn't tell if we'd managed to hit some degenerate case bc of a particular version of either the operator or OpenShift itself. For our mutual clients, what would we want to offer as guidance, ie "if you see ^^, then ___." Would @relyt0925 's workaround be a reasonable interim fix?

mikehollinger avatar Oct 27 '22 19:10 mikehollinger

@mikehollinger yes, the workaround to tag manually seems reasonable until the root cause is identified. @fabiendupont Can you help to identify why the version mismatch here? May be NFD didn't label the version correctly?

shivamerla avatar Oct 27 '22 21:10 shivamerla

@relyt0925 @mikehollinger we have tried to reproduce this and see that tag 411.86.202210072320-0 was present by default.

$  oc get imagestream -n openshift driver-toolkit
NAME             IMAGE REPOSITORY   TAGS                           UPDATED
driver-toolkit                      411.86.202210072320-0,latest   3 days ago

also, label:

feature.node.kubernetes.io/system-os_release.OSTREE_VERSION=411.86.202210072320-0

need to understand why the DTK tag is missing in your case. Do you see this happening with other clusters too?

shivamerla avatar Nov 01 '22 23:11 shivamerla

@relyt0925 ^^ ?

mikehollinger avatar Nov 02 '22 00:11 mikehollinger

I hit the same issue while installing the GPU Operator on IBM RH Openshift 4.10

Running nv-ctr-run-with-dtk
+ [[ true == \t\r\u\e ]]
+ echo 'WARNING: RHCOS '\''410.84.202303060052-0'\'' imagetag missing, using entitlement-based fallback'
+ exec bash -x nvidia-driver init
WARNING: RHCOS '410.84.202303060052-0' imagetag missing, using entitlement-based fallback
de213022@Rainers-MBP ~ % oc get imagestream -n openshift driver-toolkit
NAME             IMAGE REPOSITORY                                                            TAGS                           UPDATED
driver-toolkit   image-registry.openshift-image-registry.svc:5000/openshift/driver-toolkit   410.84.202302090253-0,latest   2 hours ago

Creating the tag manually fixed it.

rhocheck avatar Mar 30 '23 15:03 rhocheck

@fabiendupont can you please check why the DTK image tags can be missing in this case?

shivamerla avatar Mar 30 '23 19:03 shivamerla

@shivamerla it is expected in the Hypershift flavor of openshift: https://github.com/openshift/hypershift/tree/main/hypershift-operator/controllers

That masters and workers can be at different patch versions: hence the difference in tags and the need currently to do the additional tagging. We do have clusters that can be looked at to see this.

Ultimately I believe the operator has to have a bit more logic to appropriately choose the tags in these environments (and in the interim this workaround can be done)

relyt0925 avatar Apr 26 '23 14:04 relyt0925

Note: the algortihm and something that I believe should be handled long term in the NVIDIA gpu operator in hypershift environments is the following.

relyt0925 avatar Feb 13 '24 03:02 relyt0925

With NFD deployed each node is labeled with the appropriate OS tree version

      feature.node.kubernetes.io/system-os_release.OSTREE_VERSION: 412.86.202310141028-0

(note that is the last part of the driver tag)

Then if the release image associated with the RHCOS node can be looked up the following can be ran oc adm release info RELEASEIMAGE | grep driver-toolkit example

oc adm release info us.icr.io/armada-master/ocp-release:4.12.47-x86_64 | grep driver-toolkit
     driver-toolkit                sha256:5e4c83a34f34bbb8d07891afa5090539aed9c0a6511c3be5655e18f1a32f90ab

with that data everything is there to perform tag

oc -n openshift tag quay.io/openshift-release-dev/ocp-v4.0-art-dev@DRIVER_TOOLKIT_SHA_VAL driver-toolkit:OSTREE_VERSION_VAL

from the example above:

oc -n openshift tag quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5e4c83a34f34bbb8d07891afa5090539aed9c0a6511c3be5655e18f1a32f90ab driver-toolkit:412.86.202310141028-0

relyt0925 avatar Feb 13 '24 03:02 relyt0925