GPU Operator on disconnected Openshift 4.15
Hello, we installed GPU Operator v24.9.0 provided through Red Hat marketplace OLM catalog in our disconnected 4.15.28 Openshift cluster nvidia-driver-ctr is reporting complete installation and a quick vector-add application terminated correctly, so I don't think this issue is blocking. However, we have few questions regarding nvidia-driver-ctr :
Default UBI Repositories / UBI Base images
The nvidia-driver-ctr container uses ubi8 repositories, while the openshift-driver-toolkit-ctr container uses ubi9 repositories. This is due to their underlying base image versions.
$ oc -n ocp-nvidia-gpu-operator exec -c nvidia-driver-ctr nvidia-driver-daemonset-415.92.202408100433-0-2gcjc -- cat /etc/redhat-release
Red Hat Enterprise Linux release 8.10 (Ootpa)
[openshift@bdf-build8-bastion entitlement]$ oc -n ocp-nvidia-gpu-operator exec -c nvidia-driver-ctr nvidia-driver-daemonset-415.92.202408100433-0-kcpkp -- cat /etc/yum.repos.d/ubi.repo
[ubi-8-baseos-rpms]
name = Red Hat Universal Base Image 8 (RPMs) - BaseOS
baseurl = https://cdn-ubi.redhat.com/content/public/ubi/dist/ubi8/8/$basearch/baseos/os
enabled = 1
gpgkey = file:///etc/pki/rpm-gpg/RPM-GPG-KEY-redhat-release
gpgcheck = 1
....
$ oc -n ocp-nvidia-gpu-operator exec -c openshift-driver-toolkit-ctr nvidia-driver-daemonset-415.92.202408100433-0-2gcjc -- cat /etc/redhat-release
Red Hat Enterprise Linux release 9.2 (Plow)
[openshift@bdf-build8-bastion entitlement]$ oc -n ocp-nvidia-gpu-operator exec -c openshift-driver-toolkit-ctr nvidia-driver-daemonset-415.92.202408100433-0-kcpkp -- cat /etc/yum.repos.d/ubi.repo
[ubi-9-baseos]
name = Red Hat Universal Base Image 9 (RPMs) - BaseOS
baseurl = https://cdn-ubi.redhat.com/content/public/ubi/dist/ubi9/9/$basearch/baseos/os
enabled = 1
gpgkey = file:///etc/pki/rpm-gpg/RPM-GPG-KEY-redhat-release
gpgcheck = 1
...
our questions here:
- Does this difference in UBI repositories potentially break compatibility for the kernel modules?
- will a new version use UBI9 soon?
Entitlement Key Mounting
As we are using Satellite to provide repositories in our disconnected environment, we need to entitle pods to access the different repositories.
$ oc get cm -n ocp-nvidia-gpu-operator yum-repos -o yaml
apiVersion: v1
data:
redhat.repo: |
[rhel-9-for-x86_64-baseos-rpms]
name = Red Hat Enterprise Linux 9 for x86_64 - BaseOS (RPMs)
baseurl = https://<9.2 baseos repo>
enabled = 1
gpgcheck = 0
repo_gpgcheck = 0
sslverify = 0
module_hotfixes=true
sslclientkey = /run/secrets/etc-pki-entitlement/<entitlement_key_name>-key.pem
sslclientcert = /run/secrets/etc-pki-entitlement/<entitlement_key_name>.pem
[rhel-9-for-x86_64-appstream-rpms]
name = Red Hat Enterprise Linux 9 for x86_64 - AppStream (RPMs)
baseurl = https://<9.2 appstream repo>
enabled = 1
gpgcheck = 0
sslverify = 0
repo_gpgcheck = 0
module_hotfixes=true
sslclientkey = /run/secrets/etc-pki-entitlement/<entitlement_key_name>-key.pem
sslclientcert = /run/secrets/etc-pki-entitlement/<entitlement_key_name>.pem
kind: ConfigMap
metadata:
name: yum-repos
namespace: ocp-nvidia-gpu-operator
$ oc get clusterpolicies.nvidia.com gpu-cluster-policy -o json | jq '.spec.driver.repoConfig'
{
"configMapName": "yum-repos"
}
The relevant ConfigMap (yum-repos) is mounted only to the nvidia-driver-ctr container, not the openshift-driver-toolkit-ctr container.
$ oc -n ocp-nvidia-gpu-operator exec -c openshift-driver-toolkit-ctr nvidia-driver-daemonset-415.92.202408100433-0-kcpkp -- cat /etc/yum.repos.d/redhat.repo
#
# Certificate-Based Repositories
# Managed by (rhsm) subscription-manager
#
# *** This file is auto-generated. Changes made here will be over-written. ***
# *** Use "subscription-manager repo-override --help" if you wish to make changes. ***
#
# If this file is empty and this system is subscribed consider
# a "yum repolist" to refresh available repos
#
$ oc -n ocp-nvidia-gpu-operator exec -c nvidia-driver-ctr nvidia-driver-daemonset-415.92.202408100433-0-kcpkp -- cat /etc/yum.repos.d/redhat.repo
[rhel-9-for-x86_64-baseos-rpms]
name = Red Hat Enterprise Linux 9 for x86_64 - BaseOS (RPMs)
baseurl = https://<9.2 baseos repo>
enabled = 1
gpgcheck = 0
repo_gpgcheck = 0
sslverify = 0
module_hotfixes=true
sslclientkey = /run/secrets/etc-pki-entitlement/<entitlement_key_name>-key.pem
sslclientcert = /run/secrets/etc-pki-entitlement/<entitlement_key_name>.pem
.....
our questions here:
- Is there a way to inject the yum-repos ConfigMap into the openshift-driver-toolkit-ctr container as well?
- If not, should we consider building a custom ImageStream for the GPU Operator that includes a modified driver-toolkit image with the necessary repositories pre-configured? Currently, the Operator injects a sidecar with openshift/istag/driver-toolkit:${RHCOS_VERSION}.
openshift-driver-toolkit-ctr Container Logs
The openshift-driver-toolkit-ctr container logs show attempts to enable additional repositories:
$ oc -n ocp-nvidia-gpu-operator logs -c openshift-driver-toolkit-ctr nvidia-driver-daemonset-415.92.202408100433-0-kcpkp | grep dnf
+ ln -s /usr/bin/true /mnt/shared-nvidia-driver-toolkit/bin/dnf --force
+ dnf install -q -y elfutils-libelf.x86_64 elfutils-libelf-devel.x86_64
+ dnf config-manager --set-enabled rhocp-4.15-for-rhel-8-x86_64-rpms
+ dnf makecache --releasever=9.2
+ dnf config-manager --set-enabled rhel-8-for-x86_64-baseos-eus-rpms
+ dnf makecache --releasever=9.2
+ dnf makecache --releasever=9.2
+ dnf -q -y --releasever=9.2 install kernel-headers-5.14.0-284.79.1.el9_2.x86_64 kernel-devel-5.14.0-284.79.1.el9_2.x86_64
+ dnf -q -y --releasever=9.2 install kernel-core-5.14.0-284.79.1.el9_2.x86_64
+ dnf install -q -y --releasever=9.2 gcc-
Question:
What specific repositories are truly needed for the openshift-driver-toolkit-ctr container to function correctly?
Need for openshift-driver-toolkit-ctr container
As I see that nvidia-driver-ctr is reporting complete installation despite precedent problems, do we really need this openshift-driver-toolkit-ctr sidecontainer ?
Thank you in advance for your help!
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed. To skip these checks, apply the "lifecycle/frozen" label.