Node Selector Missing + Image reference invalid
We installed the GPU operator version 1.7.1 in one of our OCP 4.7.16 clusters. We are facing 2 issues with our current deployment:
- the driver daemonset does not include a node selector which means pods are also running on nodes with no GPUs installed. This leads to CrashLooping driver pods.
- the GPU operator logs include an error message regarding the DCGM exporter: Failed to apply transformation 'nvidia-dcgm-exporter' with error: 'Invalid values for building container image path provided, please update the ClusterPolicy instance' {"Daemonset": "nvidia-dcgm-exporter"}
Can you tell us what is wrong with our current ClusterPolicy configuration?
Also for reference we are using the latest 4.7 version of the node feature discovery operator with the image (registry.redhat.io/openshift4/ose-node-feature-discovery:v4.7.0-202106090743.p0.git.5b1bc4f).
Our ClusterPolicy configuration:
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
name: gpu-cluster-policy
spec:
migManager:
image: k8s-mig-manager
repository: dockerregistry.site.example.com/nvidia
version: v0.1.0-ubi8
operator:
defaultRuntime: crio
gfd:
image: gpu-feature-discovery
repository: dockerregistry.site.example.com/nvidia
version: v0.4.1
dcgmExporter:
image: dcgm-exporter
repository: dockerregistry.site.example.com/nvidia
version: 2.1.8-2.4.0-rc.2-ubi8
driver:
image: driver
repository: dockerregistry.site.example.com/nvidia
version: 460.73.01
devicePlugin:
args:
- '--mig-strategy=mixed'
- '--pass-device-specs=true'
- '--fail-on-init-error=true'
- '--device-list-strategy=volume-mounts'
- '--nvidia-driver-root=/run/nvidia/driver'
image: k8s-device-plugin
repository: dockerregistry.site.example.com/nvidia
version: v0.9.0-ubi8
mig:
strategy: mixed
validator:
image: gpu-operator-validator
repository: dockerregistry.site.example.com/nvidia
version: v1.7.0
toolkit:
image: container-toolkit
repository: dockerregistry.site.example.com/nvidia
version: 1.5.0-ubi8
@koflerm initContainer spec is missing here. This is required for some of the initContainers we use(dcgm etc). Please refer here and add this along with nodeSelector fields for all components. We are removing the nodeSelector as configurable from next version, but with 1.7.1 it should be explicitly specified in ClusterPolicy instance.
Hi @shivamerla. 1.6.x applied nodeSelectors (nvidia.com/gpu.present: 'true') automatically to all DaemonSets. Did this change in 1.7.x?
@geoberle yes, now each component has a nodeSelector label as below. GPU operator will automatically add these labels to nodes with NVIDIA GPU's. The reason for this granularity is to evict individual component pods when required(for eg MIG config change requires device-plugin, GFD pods to be evicted on the node and started again).
nvidia.com/gpu.deploy.mig-manager=true
nvidia.com/gpu.deploy.container-toolkit=true
nvidia.com/gpu.deploy.dcgm-exporter=true
nvidia.com/gpu.deploy.device-plugin=true
nvidia.com/gpu.deploy.driver=true
nvidia.com/gpu.deploy.gpu-feature-discovery=true
nvidia.com/gpu.deploy.operator-validator=true
As we don't expect user to change these values, it should have been added by default, but was missed in 1.7.x. So these have to be specified in ClusterPolicy instance as defined here.
Oh understood. Makes absolute sense. Nice design! Thank you for clarifying.
@shivamerla this was already very useful I was now able to deploy all components the only problem I see now is after the initial installation if I re-deploy the whole clusterpolicy / nvidia setup the driver pods keep crashing with the error "driver in use" because I guess another nvidia component is quicker claiming the GPUs.
Do you also have a fix for that? This is now my updated version of the clusterpolicy:
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
name: gpu-cluster-policy
spec:
dcgmExporter:
repository: dockerregistry.site.example.com/nvidia
version: "2.1.8-2.4.0-rc.2-ubi8"
image: dcgm-exporter
nodeSelector:
nvidia.com/gpu.deploy.dcgm-exporter: 'true'
devicePlugin:
repository: dockerregistry.site.example.com/nvidia
version: v0.9.0-ubi8
image: k8s-device-plugin
nodeSelector:
nvidia.com/gpu.deploy.device-plugin: 'true'
env:
- name: PASS_DEVICE_SPECS
value: "true"
- name: FAIL_ON_INIT_ERROR
value: "true"
- name: DEVICE_LIST_STRATEGY
value: "envvar"
- name: DEVICE_ID_STRATEGY
value: "uuid"
- name: NVIDIA_VISIBLE_DEVICES
value: "all"
- name: NVIDIA_DRIVER_CAPABILITIES
value: "all"
driver:
repository: dockerregistry.site.example.com/nvidia
version: 460.73.01
image: driver
nodeSelector:
nvidia.com/gpu.deploy.driver: 'true'
# attention - the version field for the driver must be fefined without the -rhcos4.6 suffix
gfd:
repository: dockerregistry.site.example.com/nvidia
version: v0.4.1
image: gpu-feature-discovery
nodeSelector:
nvidia.com/gpu.deploy.gpu-feature-discovery: 'true'
env:
- name: GFD_SLEEP_INTERVAL
value: "60s"
- name: FAIL_ON_INIT_ERROR
value: "true"
operator:
defaultRuntime: crio
deployGFD: true
initContainer:
image: cuda
repository: dockerregistry.site.example.com/nvidia
version: 11.4.0-base-ubi8
validator:
image: gpu-operator-validator
repository: dockerregistry.site.example.com/nvidia
version: v1.7.0
nodeSelector:
nvidia.com/gpu.deploy.operator-validator: 'true'
env:
- name: WITH_WORKLOAD
value: "true"
mig:
strategy: mixed
migManager:
repository: dockerregistry.site.example.com/nvidia
version: v0.1.0-ubi8
image: k8s-mig-manager
nodeSelector:
nvidia.com/gpu.deploy.mig-manager: 'true'
env:
- name: WITH_REBOOT
value: "false"
toolkit:
repository: dockerregistry.site.example.com/nvidia
version: 1.5.0-ubi8
image: container-toolkit
nodeSelector:
nvidia.com/gpu.deploy.container-toolkit: 'true'
@koflerm Unloading an existing driver is little involved when driver container restarts. We are automating this with next upcoming release. Meanwhile you will need to evict all other GPU operator pods on each node at a time with below command.
oc label --overwrite \
node ${NODE_NAME} \
nvidia.com/gpu.deploy.operator-validator=false \
nvidia.com/gpu.deploy.container-toolkit=false \
nvidia.com/gpu.deploy.device-plugin=false \
nvidia.com/gpu.deploy.gpu-feature-discovery=false \
nvidia.com/gpu.deploy.dcgm-exporter=false
Once all are evicted, driver pod can be restarted to let it cleanly rmmod existing driver and install again. Once driver is loaded, you can re-run all these pods again by setting the labels to true. Same has to be repeated on each node.
A reboot fixed the problem. And again I come with another one: I changed now the device list strategy of the device Plugin component to "volume-mounts" instead of "envvar" by setting the according value in the DEVICE_LIST_STRATEGY env variable. Now, the nvidia-device-plugin-validator pod cannot be initialized (CreateContainerError) with the following error: nvidia-container-cli.real: device error: /var/run/nvidia-container-devices: unknown device\n". According to some research I need to set the property "accept-nvidia-visible-devices-as-volume-mounts" to true (https://docs.google.com/document/d/1uXVF-NWZQXgP1MLb87_kMkQvidpnkNWicdpO2l9g-fw/edit#) but I found no easy solution to set this via the operator. Do you know how to set this and if this will fix the problem?
@koflerm unfortunately we don't have a way to set this through GPU operator yet. You need to edit file /usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml manually on each GPU node.
@shivamerla I have fixed this now by creating a modified version of this file as a config map and by mounting this in the container-toolkit daemonset at /etc/nvidia-container-runtime/config.toml. This works I guess also for now it is fine as the operator does not seem to reconcile the created resources. Is there any option planned in the future to configure this via the operator?
@koflerm I have created a ticket to track this request. We will post an update here as soon as we have a decision / timeline.
@shivamerla I have fixed this now by creating a modified version of this file as a config map and by mounting this in the container-toolkit daemonset at /etc/nvidia-container-runtime/config.toml. This works I guess also for now it is fine as the operator does not seem to reconcile the created resources. Is there any option planned in the future to configure this via the operator?
Yes, we will have reconciliation with upcoming release(code is already merged). So these settings have to be plumbed through GPU operator variables/env. I think Evan already created a tracking ticket for this.