gpu-operator Node Selector Missing + Image reference invalid

We installed the GPU operator version 1.7.1 in one of our OCP 4.7.16 clusters. We are facing 2 issues with our current deployment:

the driver daemonset does not include a node selector which means pods are also running on nodes with no GPUs installed. This leads to CrashLooping driver pods.
the GPU operator logs include an error message regarding the DCGM exporter: Failed to apply transformation 'nvidia-dcgm-exporter' with error: 'Invalid values for building container image path provided, please update the ClusterPolicy instance' {"Daemonset": "nvidia-dcgm-exporter"}

Can you tell us what is wrong with our current ClusterPolicy configuration?

Also for reference we are using the latest 4.7 version of the node feature discovery operator with the image (registry.redhat.io/openshift4/ose-node-feature-discovery:v4.7.0-202106090743.p0.git.5b1bc4f).

Our ClusterPolicy configuration:

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: gpu-cluster-policy
spec:
  migManager:
    image: k8s-mig-manager
    repository: dockerregistry.site.example.com/nvidia
    version: v0.1.0-ubi8
  operator:
    defaultRuntime: crio
  gfd:
    image: gpu-feature-discovery
    repository: dockerregistry.site.example.com/nvidia
    version: v0.4.1
  dcgmExporter:
    image: dcgm-exporter
    repository: dockerregistry.site.example.com/nvidia
    version: 2.1.8-2.4.0-rc.2-ubi8
  driver:
    image: driver
    repository: dockerregistry.site.example.com/nvidia
    version: 460.73.01
  devicePlugin:
    args:
      - '--mig-strategy=mixed'
      - '--pass-device-specs=true'
      - '--fail-on-init-error=true'
      - '--device-list-strategy=volume-mounts'
      - '--nvidia-driver-root=/run/nvidia/driver'
    image: k8s-device-plugin
    repository: dockerregistry.site.example.com/nvidia
    version: v0.9.0-ubi8
  mig:
    strategy: mixed
  validator:
    image: gpu-operator-validator
    repository: dockerregistry.site.example.com/nvidia
    version: v1.7.0
  toolkit:
    image: container-toolkit
    repository: dockerregistry.site.example.com/nvidia
    version: 1.5.0-ubi8

Jul 05 '21 14:07 koflerm

@koflerm initContainer spec is missing here. This is required for some of the initContainers we use(dcgm etc). Please refer here and add this along with nodeSelector fields for all components. We are removing the nodeSelector as configurable from next version, but with 1.7.1 it should be explicitly specified in ClusterPolicy instance.

Jul 05 '21 17:07 shivamerla

Hi @shivamerla. 1.6.x applied nodeSelectors (nvidia.com/gpu.present: 'true') automatically to all DaemonSets. Did this change in 1.7.x?

Jul 05 '21 17:07 geoberle

@geoberle yes, now each component has a nodeSelector label as below. GPU operator will automatically add these labels to nodes with NVIDIA GPU's. The reason for this granularity is to evict individual component pods when required(for eg MIG config change requires device-plugin, GFD pods to be evicted on the node and started again).

                    nvidia.com/gpu.deploy.mig-manager=true
                    nvidia.com/gpu.deploy.container-toolkit=true
                    nvidia.com/gpu.deploy.dcgm-exporter=true
                    nvidia.com/gpu.deploy.device-plugin=true
                    nvidia.com/gpu.deploy.driver=true
                    nvidia.com/gpu.deploy.gpu-feature-discovery=true
                    nvidia.com/gpu.deploy.operator-validator=true

As we don't expect user to change these values, it should have been added by default, but was missed in 1.7.x. So these have to be specified in ClusterPolicy instance as defined here.

Jul 05 '21 18:07 shivamerla

Oh understood. Makes absolute sense. Nice design! Thank you for clarifying.

Jul 05 '21 18:07 geoberle

@shivamerla this was already very useful I was now able to deploy all components the only problem I see now is after the initial installation if I re-deploy the whole clusterpolicy / nvidia setup the driver pods keep crashing with the error "driver in use" because I guess another nvidia component is quicker claiming the GPUs.

Do you also have a fix for that? This is now my updated version of the clusterpolicy:

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: gpu-cluster-policy
spec:
  dcgmExporter:
    repository: dockerregistry.site.example.com/nvidia
    version: "2.1.8-2.4.0-rc.2-ubi8"
    image: dcgm-exporter
    nodeSelector:
      nvidia.com/gpu.deploy.dcgm-exporter: 'true'
  devicePlugin:
    repository: dockerregistry.site.example.com/nvidia
    version: v0.9.0-ubi8
    image: k8s-device-plugin
    nodeSelector:
      nvidia.com/gpu.deploy.device-plugin: 'true'
    env:
      - name: PASS_DEVICE_SPECS
        value: "true"
      - name: FAIL_ON_INIT_ERROR
        value: "true"
      - name: DEVICE_LIST_STRATEGY
        value: "envvar"
      - name: DEVICE_ID_STRATEGY
        value: "uuid"
      - name: NVIDIA_VISIBLE_DEVICES
        value: "all"
      - name: NVIDIA_DRIVER_CAPABILITIES
        value: "all"
  driver:
    repository: dockerregistry.site.example.com/nvidia
    version: 460.73.01
    image: driver
    nodeSelector:
      nvidia.com/gpu.deploy.driver: 'true'
    # attention - the version field for the driver must be fefined without the -rhcos4.6 suffix
  gfd:
    repository: dockerregistry.site.example.com/nvidia
    version: v0.4.1
    image: gpu-feature-discovery
    nodeSelector:
      nvidia.com/gpu.deploy.gpu-feature-discovery: 'true'
    env:
      - name: GFD_SLEEP_INTERVAL
        value: "60s" 
      - name: FAIL_ON_INIT_ERROR
        value: "true"       
  operator:
    defaultRuntime: crio
    deployGFD: true
    initContainer:
      image: cuda
      repository: dockerregistry.site.example.com/nvidia
      version: 11.4.0-base-ubi8
  validator:
    image: gpu-operator-validator
    repository: dockerregistry.site.example.com/nvidia
    version: v1.7.0
    nodeSelector:
      nvidia.com/gpu.deploy.operator-validator: 'true'
    env:
      - name: WITH_WORKLOAD
        value: "true"
  mig:
    strategy: mixed
  migManager:
    repository: dockerregistry.site.example.com/nvidia
    version: v0.1.0-ubi8
    image: k8s-mig-manager
    nodeSelector:
      nvidia.com/gpu.deploy.mig-manager: 'true'
    env:
      - name: WITH_REBOOT
        value: "false"
  toolkit:
    repository: dockerregistry.site.example.com/nvidia
    version: 1.5.0-ubi8
    image: container-toolkit
    nodeSelector:
      nvidia.com/gpu.deploy.container-toolkit: 'true'

Jul 05 '21 19:07 koflerm

@koflerm Unloading an existing driver is little involved when driver container restarts. We are automating this with next upcoming release. Meanwhile you will need to evict all other GPU operator pods on each node at a time with below command.

    oc label --overwrite \
        node ${NODE_NAME} \
        nvidia.com/gpu.deploy.operator-validator=false \
        nvidia.com/gpu.deploy.container-toolkit=false \
        nvidia.com/gpu.deploy.device-plugin=false \
        nvidia.com/gpu.deploy.gpu-feature-discovery=false \
        nvidia.com/gpu.deploy.dcgm-exporter=false

Once all are evicted, driver pod can be restarted to let it cleanly rmmod existing driver and install again. Once driver is loaded, you can re-run all these pods again by setting the labels to true. Same has to be repeated on each node.

Jul 05 '21 23:07 shivamerla

A reboot fixed the problem. And again I come with another one: I changed now the device list strategy of the device Plugin component to "volume-mounts" instead of "envvar" by setting the according value in the DEVICE_LIST_STRATEGY env variable. Now, the nvidia-device-plugin-validator pod cannot be initialized (CreateContainerError) with the following error: nvidia-container-cli.real: device error: /var/run/nvidia-container-devices: unknown device\n". According to some research I need to set the property "accept-nvidia-visible-devices-as-volume-mounts" to true (https://docs.google.com/document/d/1uXVF-NWZQXgP1MLb87_kMkQvidpnkNWicdpO2l9g-fw/edit#) but I found no easy solution to set this via the operator. Do you know how to set this and if this will fix the problem?

Jul 06 '21 12:07 koflerm

@koflerm unfortunately we don't have a way to set this through GPU operator yet. You need to edit file /usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml manually on each GPU node.

Jul 09 '21 03:07 shivamerla

@shivamerla I have fixed this now by creating a modified version of this file as a config map and by mounting this in the container-toolkit daemonset at /etc/nvidia-container-runtime/config.toml. This works I guess also for now it is fine as the operator does not seem to reconcile the created resources. Is there any option planned in the future to configure this via the operator?

Jul 09 '21 06:07 koflerm

@koflerm I have created a ticket to track this request. We will post an update here as soon as we have a decision / timeline.

Jul 09 '21 09:07 elezar

@shivamerla I have fixed this now by creating a modified version of this file as a config map and by mounting this in the container-toolkit daemonset at /etc/nvidia-container-runtime/config.toml. This works I guess also for now it is fine as the operator does not seem to reconcile the created resources. Is there any option planned in the future to configure this via the operator?

Yes, we will have reconciliation with upcoming release(code is already merged). So these settings have to be plumbed through GPU operator variables/env. I think Evan already created a tracking ticket for this.

Jul 09 '21 14:07 shivamerla