Install On Air-Gapped OKD 4.15.0-0 FCOS

Open jvincze84 opened this issue 7 months ago • 1 comments

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

Describe the bug

We are trying to install the operator on OKD, but we get this error:

{"level":"error","ts":"2025-05-19T13:49:04Z","msg":"Reconciler error","controller":"clusterpolicy-controller","object":{"name":"gpu-cluster-policy"},"namespace":"","name":"gpu-cluster-policy","reconcileID":"a535d3ea-ebc7-4c22-9e62-7c372c6814c0","error":"failed to handle OpenShift Driver Toolkit Daemonset for version 39.20240210.3.0: ERROR: failed to get destination directory for custom repo config: distribution not supported"}

We have an air-gapped environment, so trying to use repo config option:

    repoConfig:
      configMapName: repo-config

But we noticed that "fedora" is missing from the Map:

https://github.com/NVIDIA/gpu-operator/blob/349cf4ff779d98401cf923bda34498b48adb68e4/internal/state/driver_volumes.go#L33-L39

Details:

[gpu-operator@gpu-operator-6ffdc677f6-92828 /]$ cat /host-etc/os-release
NAME="Fedora Linux"
VERSION="39.20240210.3.0 (CoreOS)"
ID=fedora
VERSION_ID=39
VERSION_CODENAME=""
PLATFORM_ID="platform:f39"
PRETTY_NAME="Fedora CoreOS 39.20240210.3.0"
ANSI_COLOR="0;38;2;60;110;180"
LOGO=fedora-logo-icon
CPE_NAME="cpe:/o:fedoraproject:fedora:39"
HOME_URL="https://getfedora.org/coreos/"
DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora-coreos/"
SUPPORT_URL="https://github.com/coreos/fedora-coreos-tracker/"
BUG_REPORT_URL="https://github.com/coreos/fedora-coreos-tracker/"
REDHAT_BUGZILLA_PRODUCT="Fedora"
REDHAT_BUGZILLA_PRODUCT_VERSION=39
REDHAT_SUPPORT_PRODUCT="Fedora"
REDHAT_SUPPORT_PRODUCT_VERSION=39
SUPPORT_END=2024-11-12
VARIANT="CoreOS"
VARIANT_ID=coreos
OSTREE_VERSION='39.20240210.3.0'

The ID=fedora

Is it a bug, or on purpose? Is OKD fedora supported?

To Reproduce

Install the operator on an air-gapped environment with custom repo config.,

Expected behavior

Successful install on air-gapped OKD.

Environment (please provide the following information):

GPU Operator Version: 25.3.0
OS: [e.g. Ubuntu24.04]
Kernel Version: 6.7.4-200.fc39.x86_64
Container Runtime Version: v1.28.7+6e2789b (crio)
Kubernetes Distro and Version: OKD - Cluster version is 4.15.0-0.okd-2024-03-10-010116

Information to attach (optional if deemed irrelevant)

[X] kubernetes pods status: kubectl get pods -n OPERATOR_NAMESPACE

NAME                                                     READY   STATUS             RESTARTS         AGE
gpu-feature-discovery-4xc4h                              0/1     Init:0/1           0                4m4s
gpu-operator-6ffdc677f6-92828                            1/1     Running            0                53m
nvidia-container-toolkit-daemonset-mn5mj                 0/1     Init:0/1           0                4m5s
nvidia-dcgm-exporter-86njk                               0/1     Init:0/1           0                4m5s
nvidia-dcgm-qbjfc                                        0/1     Init:0/1           0                4m5s
nvidia-device-plugin-daemonset-cv24p                     0/1     Init:0/1           0                4m5s
nvidia-driver-daemonset-39.20240210.3.0-88f4n            1/2     CrashLoopBackOff   24 (4m11s ago)   108m
nvidia-driver-daemonset-392024021030-88f4n-debug-mrq2c   2/2     Running            0                4m37s
nvidia-node-status-exporter-8kk4n                        1/1     Running            0                108m
nvidia-operator-validator-mvr4n                          0/1     Init:0/4           0                4m5s

[ ] kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE

NAME                                      DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                                                                                   AGE
gpu-feature-discovery                     1         1         0       1            0           nvidia.com/gpu.deploy.gpu-feature-discovery=true                                                                3d3h
nvidia-container-toolkit-daemonset        1         1         0       1            0           nvidia.com/gpu.deploy.container-toolkit=true                                                                    3d3h
nvidia-dcgm                               1         1         0       1            0           nvidia.com/gpu.deploy.dcgm=true                                                                                 3d3h
nvidia-dcgm-exporter                      1         1         0       1            0           nvidia.com/gpu.deploy.dcgm-exporter=true                                                                        3d3h
nvidia-device-plugin-daemonset            1         1         0       1            0           nvidia.com/gpu.deploy.device-plugin=true                                                                        3d3h
nvidia-device-plugin-mps-control-daemon   0         0         0       0            0           nvidia.com/gpu.deploy.device-plugin=true,nvidia.com/mps.capable=true                                            3d3h
nvidia-driver-daemonset-39.20240210.3.0   1         1         0       1            0           feature.node.kubernetes.io/system-os_release.OSTREE_VERSION=39.20240210.3.0,nvidia.com/gpu.deploy.driver=true   3d3h
nvidia-mig-manager                        0         0         0       0            0           nvidia.com/gpu.deploy.mig-manager=true                                                                          3d3h
nvidia-node-status-exporter               1         1         1       1            1           nvidia.com/gpu.deploy.node-status-exporter=true                                                                 3d3h
nvidia-operator-validator                 1         1         0       1            0           nvidia.com/gpu.deploy.operator-validator=true                                                                   3d3h

[ ] If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
[ ] If a pod/ds is in an error state or pending state kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
[ ] Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
[ ] containerd logs journalctl -u containerd > containerd.log

Collecting full debug bundle (optional):

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh

NOTE: please refer to the must-gather script for debug data collected.

This bundle can be submitted to us via email: [email protected]

Thanks a lot.

May 19 '25 13:05 jvincze84