Heterogenous cluster with airgap failing to detect customrepo configmap
1. Quick Debug Information
- OS/Version: master node and other worker node Rocky 8.8 , Gpu worker node Rhel 8.8
- Kernel Version: 4.18.0-477.15.1.el8_8.x86_64
- Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): k8s (v1.24.12)
- GPU Operator Version: 23.3.2
2. Issue or feature description
Kubernetes cluster with master node and non gpu worker node running on Rocky 8.8 OS and gpu worker node running on Rhel 8.8 OS in an airgapped environment. The custom repo configmap that is injected to the driver daemon set says not supported. Following is an image from gpu-operator pod logs. Because of this only gpu-operator pod and gpu-node-feature-discovery pods are coming up, rest of the pods like driver, container-toolkit, dcgm-exporter etc are missing ( their daemonsets are also not present)
3. Steps to reproduce the issue
- create a kubernetes cluster with 1 master (Rocky OS 8.8) and 2 worker nodes (1 node with Rocky OS 8.8 and other gpu node with Rhel OS 8.8
- Install gpu operator through helm with 23.3.2 version, also pass custom configmap for driver in values.yaml
driver:
repoConfig:
configMapName: "repo-config"
4. Information to attach (optional if deemed irrelevant)
- [ ] kubernetes all resource status:
kubectl get all -n gpu-operator
5. Our debug analysis
- We found that gpu operator pod is always scheduled on the master node because of nodeAffinity
- We edited the gpu operator deployment and provided nodeName for it to be scheduled on the gpu node. Once the gpu operator pod started running on the gpu node all the pods (driver daemonset , toolkit, dcgm exporter etc) came up
- If it is not supported in heterogenous, when we dont pass custom configmap ( non airgap scenario), there is no error in gpu operator pod logs and all the pods (driver, containertoolkit, dcgm-exporter etc) are up
- We want to understand why there is a custom configmap check added, even if it is added, we want to understand why the distribution is not supported for heterogenous cluster in an airgap environment
@alloydm this is a bug that we need to address, for the GPU stack we only check for OS version labels on GPU nodes but for injecting custom ConfigMap we are using /etc/os-release on the current operator node. We can change this to use the OS version label on the GPU node as well. We will fix this in the upcoming patch, until then your workaround to let the operator pod run on worker nodes is a good option.
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed. To skip these checks, apply the "lifecycle/frozen" label.
This has not been addressed. Removing the lifecycle/stale label and marking this as a bug.
The getRepoConfig() method should be updated to use the OS version labels from GPU worker nodes (added by NFD) when determining what OS-specific paths to use for repository configuration files.