node-feature-discovery icon indicating copy to clipboard operation
node-feature-discovery copied to clipboard

NFD will remove and re-add node labels if nfd-worker pod is deleted (and re-created by the nfd-worker DS)

Open adrianchiris opened this issue 1 year ago • 13 comments

What happened:

NFD will remove any node labels associated with NodeFeature of a specific node if nfd-worker pod of that node gets deleted. after pod delete, it will get re-created, which will then recreate NodeFeature CR for the node and labels will be back (same goes for annotations, extendedResources).

workloads that rely on such labels in their nodeSelector/affinity will get disrupted as they will be removed and re scheduled.

This happens since nfd-worker is creating NodeFeature CR with OwnerReference pointing to itself[1]

[1] https://github.com/kubernetes-sigs/node-feature-discovery/blob/0418e7ddf33424b150c68ca8fe71fcfc98440039/pkg/nfd-worker/nfd-worker.go#L716

What you expected to happen:

At the end id expect labels to not get removed if nfd-worker pod get restarted. going further into the details, id expect NodeFeature CR is not deleted if pod is deleted.

This can be achieved by setting owner reference to nfd-worker daemonset which is not as ephemeral as the pod it creates. In addition to deal with redeploying daemonset with different selectors/affinity/tolerations the gc component can be extended to clean up NodeFeature objects for nodes that are not intended to run nfd-worker pods.

How to reproduce it (as minimally and precisely as possible):

  • Deploy NFD v0.15.0 and newer (i used master) with NodeFeatureAPI enabled.
  • Delete one of NFD worker pods
  • see NodeFeature get deleted and re-created (kubectl get nodefeatures -A -w)
  • get node labels in a loop and see labels get deleted and re-created

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): 1.30 (but will reproduce in any)
  • Cloud provider or hardware configuration: local setup
  • OS (e.g: cat /etc/os-release): N/A
  • Kernel (e.g. uname -a): N/A
  • Install tools: N/A
  • Network plugin and version (if this is a network-related bug): N/A
  • Others: N/A

adrianchiris avatar Jul 01 '24 15:07 adrianchiris