nvidia driver daemonset pod is recreated when ever there is a nfd restart
1. Quick Debug Information
- OS/Version(e.g. RHEL8.6, Ubuntu22.04): RHEL 8.10
- Kernel Version: 4.18.0-553.el8_10.x86_64
- Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): CRI-O
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): K8s
- GPU Operator Version: gpu-operator-v24-3-0, driver version: 535.183.01
2. Issue or feature description
We are trying to seperate nfd out of gpu operator namespace and deploy seperately. We installed GPU operator with precompiled as false and when there is a restart of nfd pod, the driver daemon set is terminated and restarted. When this happens the node label nvidia.com/gpu-driver-upgrade-state is still on upgrade done, due to this the pods are not evicted on the node from which the driver must be installed and the driver stays in init crash back loop off waiting for pods to be evicted.
I tried setting various env parameters like ENABLE_AUTO_DRAIN , DRAIN_USE_FORCE on k8s-driver-manager but no luck.
nfd version: 0.15.4 driver version: 535.183.01
3. Steps to reproduce the issue
- Install GPU operator with useprecompiled as false
- Restart nfd of a node
- Driver daemon set is stuck in init loop off
@charanteja333 can you clarify what you mean by "restart nfd of a node"? Do you mean that the nfd-worker pod is restarted on a GPU node?
When this happens the node label nvidia.com/gpu-driver-upgrade-state is still on upgrade done, due to this the pods are not evicted on the node from which the driver must be installed and the driver stays in init crash back loop off waiting for pods to be evicted.
As a manual workaround, can you try labeling the node with nvidia.com/gpu-driver-upgrade-state=upgrade-required? This should trigger our upgrade controller to evict all the GPU pods on the node and allow the driver to come back to a running state.
@cdesiniotis Yes nfd-worker pod is restarted on the node. Problem is this can happen frequently because life cycle of nfd and driver are seperate now and we have to work with sysadmin team to edit the labels which might cause down time.
Maybe a related issue: NFD will remove and re-add node labels if nfd-worker pod is deleted (and re-created by the nfd-worker DS)
@cdesiniotis when nfd is removing the labels and will the gpu operator pod recreate the driver daemon ? Because nfd is still marking the node as upgrade-done ( same values as before ), driver is unable to install as gpu pods are present and running on the node.
Thanks @age9990 for providing the relevant issue.
@charanteja333 Until a fix is available in NFD, I would advise downgrading NFD to a version not affected by this bug, e.g. <= v0.14.6, or disabling the garbage collector in the NFD helm values: https://github.com/kubernetes-sigs/node-feature-discovery/blob/master/deployment/helm/node-feature-discovery/values.yaml#L514
NFD v0.16.2 has been released which addresses this issue: https://github.com/kubernetes-sigs/node-feature-discovery/releases/tag/v0.16.2
We have upgraded to NFD v0.16.2 in main: https://github.com/NVIDIA/gpu-operator/pull/735
Closing this issue as GPU Operator 24.6.0 has been released and is using NFD v0.16.3 which does not contain this bug. Please re-open if you are still encountering issues.
Is there any way to find out what version of NFD I have? I think I'm seeing this on an up-to-date OpenShift system which has the nfd.4.18.0-202506092139 operator installed. But I don't know what version of NFD is included in this operator:
$ oc -n openshift-nfd exec ds/nfd-worker -- nfd-worker --version
nfd-worker undefined
I did a manual restart of nfd-worker while runing oc get node -w --show-labels and I can see that the operator removed & re-applied all the feature.node.kubernetes.io labels - so this isn't fixed in that version yet.
I also upgraded the operator to nfd.4.18.0-202506230505 (the latest available in the stable channel) and observed the same: all NFD labels removed and re-applied when nfd-worker is restarted.