node-feature-discovery icon indicating copy to clipboard operation
node-feature-discovery copied to clipboard

[v0.16.0] All node labels removed while informer cache failed to sync

Open ahmetb opened this issue 1 year ago • 10 comments

What happened:

While migrating from 0.10 to 0.16.0:

  • all node feature labels got removed
  • kubectl get nodefeatures -n node-feature-discovery was unresponsive at the time (likely because the cluster size is 4000 nodes and the NodeFeature CR objects are 130kb each by default)

What you expected to happen:

How to reproduce it (as minimally and precisely as possible): Run nfd chart by default on a 4000 node cluster.

Anything else we need to know?:

There were extensive informer sync errors in nfd-master logs (seeming to be timing out after 60s). This is likely because the LIST NodeFeatures is a very expensive call (each object is very large + a lot of Nodes in the cluster).

Attaching logs: nfd-master.log

My suspicion is that the nfd-master somehow does not wait for informer cache to sync (as the first informer sync error occurs exactly 60s after the process starts) –and it treats lack of response as "empty set of labels" and clearing the labels. (But I'm not familiar with the inner workings of the codebase, it's just a theory.)

💡 We don't see the issue on much smaller clusters.

💡 We have not yet tried v0.16.2 (release notes mention it fixes a node removal issue, but it's clear what was the root cause there).

Environment:

  • Kubernetes version (use kubectl version): v1.23.17
  • Cloud provider or hardware configuration: Bare-metal
  • OS (e.g: cat /etc/os-release): Not applicable
  • Kernel (e.g. uname -a): Not applicable
  • Install tools: installed via Helm
  • Network plugin and version (if this is a network-related bug): Not applicable
  • Others: Not applicable

ahmetb avatar Jul 19 '24 18:07 ahmetb