gpu-operator nvidia driver daemonset pod is recreated when ever there is a nfd restart

1. Quick Debug Information

OS/Version(e.g. RHEL8.6, Ubuntu22.04): RHEL 8.10
Kernel Version: 4.18.0-553.el8_10.x86_64
Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): CRI-O
K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): K8s
GPU Operator Version: gpu-operator-v24-3-0, driver version: 535.183.01

2. Issue or feature description

We are trying to seperate nfd out of gpu operator namespace and deploy seperately. We installed GPU operator with precompiled as false and when there is a restart of nfd pod, the driver daemon set is terminated and restarted. When this happens the node label nvidia.com/gpu-driver-upgrade-state is still on upgrade done, due to this the pods are not evicted on the node from which the driver must be installed and the driver stays in init crash back loop off waiting for pods to be evicted.

I tried setting various env parameters like ENABLE_AUTO_DRAIN , DRAIN_USE_FORCE on k8s-driver-manager but no luck.

nfd version: 0.15.4 driver version: 535.183.01

3. Steps to reproduce the issue

Install GPU operator with useprecompiled as false
Restart nfd of a node
Driver daemon set is stuck in init loop off

Jun 25 '24 18:06 charanteja333

@charanteja333 can you clarify what you mean by "restart nfd of a node"? Do you mean that the nfd-worker pod is restarted on a GPU node?

Jun 27 '24 17:06 cdesiniotis

When this happens the node label nvidia.com/gpu-driver-upgrade-state is still on upgrade done, due to this the pods are not evicted on the node from which the driver must be installed and the driver stays in init crash back loop off waiting for pods to be evicted.

As a manual workaround, can you try labeling the node with nvidia.com/gpu-driver-upgrade-state=upgrade-required? This should trigger our upgrade controller to evict all the GPU pods on the node and allow the driver to come back to a running state.

Jun 27 '24 17:06 cdesiniotis

@cdesiniotis Yes nfd-worker pod is restarted on the node. Problem is this can happen frequently because life cycle of nfd and driver are seperate now and we have to work with sysadmin team to edit the labels which might cause down time.

Jun 27 '24 19:06 charanteja333

Jul 02 '24 03:07 age9990

@cdesiniotis when nfd is removing the labels and will the gpu operator pod recreate the driver daemon ? Because nfd is still marking the node as upgrade-done ( same values as before ), driver is unable to install as gpu pods are present and running on the node.

Jul 02 '24 09:07 charanteja333

Thanks @age9990 for providing the relevant issue.

@charanteja333 Until a fix is available in NFD, I would advise downgrading NFD to a version not affected by this bug, e.g. <= v0.14.6, or disabling the garbage collector in the NFD helm values: https://github.com/kubernetes-sigs/node-feature-discovery/blob/master/deployment/helm/node-feature-discovery/values.yaml#L514

Jul 08 '24 18:07 cdesiniotis

NFD v0.16.2 has been released which addresses this issue: https://github.com/kubernetes-sigs/node-feature-discovery/releases/tag/v0.16.2

Jul 11 '24 23:07 cdesiniotis

We have upgraded to NFD v0.16.2 in main: https://github.com/NVIDIA/gpu-operator/pull/735

Jul 11 '24 23:07 cdesiniotis

Closing this issue as GPU Operator 24.6.0 has been released and is using NFD v0.16.3 which does not contain this bug. Please re-open if you are still encountering issues.

Jul 31 '24 16:07 cdesiniotis

Is there any way to find out what version of NFD I have? I think I'm seeing this on an up-to-date OpenShift system which has the nfd.4.18.0-202506092139 operator installed. But I don't know what version of NFD is included in this operator:

$ oc -n openshift-nfd exec ds/nfd-worker -- nfd-worker --version
nfd-worker undefined

Jul 16 '25 08:07 yrro

I did a manual restart of nfd-worker while runing oc get node -w --show-labels and I can see that the operator removed & re-applied all the feature.node.kubernetes.io labels - so this isn't fixed in that version yet.

I also upgraded the operator to nfd.4.18.0-202506230505 (the latest available in the stable channel) and observed the same: all NFD labels removed and re-applied when nfd-worker is restarted.

Jul 16 '25 10:07 yrro