gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

gpu-operator daemonsets are terminated when the NFD device labels include PCI device IDs

Open LaVLaS opened this issue 6 months ago • 1 comments

Currently the gpu-operator relies on the Node Feature Discovery operator to apply thefeature.node.kubernetes.io/pci-10de.present=true node label for any nodes that contain NVIDIA hardware.
If the gpu-operator daemonsets are already deployed and the cluster admin modifies the sources.pci.deviceLabelFields property to add the device PCI ID in the NodeFeatureDiscovery CR, this will cause an immediate termination of the gpu-operator stack.

apiVersion: nfd.openshift.io/v1
kind: NodeFeatureDiscovery
metadata:
  name: openshift-nfd
  namespace: openshift-nfd
spec:
  workerConfig:
    configData: |
      core:
        sleepInterval: 60s
      sources:
        pci:
          deviceClassWhitelist:
            - "0200"
            - "03"
            - "12"
          deviceLabelFields:
            - "vendor"
            - "device"         # <----This blocks OR terminates the NVIDIA gpu-operators stack

Since the NodeFeatureDiscovery is a shared global resource it would be a best practice to bring the ownership of gpu-operator driver rollout into a custom node label applied by a NodeFeatureRule.

You already apply the node label nvidia.com/gpu.present=true after feature.node.kubernetes.io/pci-10de.present=true is detected could nvidia.com/gpu.present=true be elevated to a NodeFeatureRule that is applied by the gpu-operator automatically after installation?

LaVLaS avatar Jun 04 '25 14:06 LaVLaS

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed. To skip these checks, apply the "lifecycle/frozen" label.

github-actions[bot] avatar Nov 04 '25 22:11 github-actions[bot]