node-feature-discovery
                                
                                
                                
                                    node-feature-discovery copied to clipboard
                            
                            
                            
                        NFD will remove and re-add node labels if nfd-worker pod is deleted (and re-created by the nfd-worker DS)
What happened:
NFD will remove any node labels associated with NodeFeature of a specific node if nfd-worker pod of that node gets deleted. after pod delete, it will get re-created, which will then recreate NodeFeature CR for the node and labels will be back (same goes for annotations, extendedResources).
workloads that rely on such labels in their nodeSelector/affinity will get disrupted as they will be removed and re scheduled.
This happens since nfd-worker is creating NodeFeature CR with OwnerReference pointing to itself[1]
[1] https://github.com/kubernetes-sigs/node-feature-discovery/blob/0418e7ddf33424b150c68ca8fe71fcfc98440039/pkg/nfd-worker/nfd-worker.go#L716
What you expected to happen:
At the end id expect labels to not get removed if nfd-worker pod get restarted. going further into the details, id expect NodeFeature CR is not deleted if pod is deleted.
This can be achieved by setting owner reference to nfd-worker daemonset which is not as ephemeral as the pod it creates. In addition to deal with redeploying daemonset with different selectors/affinity/tolerations the gc component can be extended to clean up NodeFeature objects for nodes that are not intended to run nfd-worker pods.
How to reproduce it (as minimally and precisely as possible):
- Deploy NFD v0.15.0 and newer (i used master) with NodeFeatureAPI enabled.
 - Delete one of NFD worker pods
 - see NodeFeature get deleted and re-created (kubectl get nodefeatures -A -w)
 - get node labels in a loop and see labels get deleted and re-created
 
Anything else we need to know?:
Environment:
- Kubernetes version (use 
kubectl version): 1.30 (but will reproduce in any) - Cloud provider or hardware configuration: local setup
 - OS (e.g: 
cat /etc/os-release): N/A - Kernel (e.g. 
uname -a): N/A - Install tools: N/A
 - Network plugin and version (if this is a network-related bug): N/A
 - Others: N/A