node-feature-discovery icon indicating copy to clipboard operation
node-feature-discovery copied to clipboard

core.sleepInterval ignored; kubelet sync interval causes relabeling every 60 seconds

Open mac-chaffee opened this issue 3 years ago • 9 comments

What happened:

Even if you raise core.sleepInterval above the default of 60 seconds, NFD will still relabel nodes every 60 seconds. This is because the sleepInterval is ignored if the configmap "changes". NFD believes the configmap is being "changed" every 60 seconds because that is kubelet's default --sync-frequency.

Kubelet doesn't actually modify the file; if you just run inotifyd inside an alpine pod with a mounted configmap, you see that kubelet opens, accesses, and closes the file at least every 60 seconds.

What you expected to happen:

nfd only relabels nodes if the configmap was modified, not just opened/accessed/closed by kubelet.

How to reproduce it (as minimally and precisely as possible):

  1. Deploy nfd (via helm) and set core.sleepInterval: 300s
  2. Observe relabeling still happens every 60 seconds or so, with these logs (notice the "reloading configuration" line)
I0413 19:20:22.577223       1 nfd-worker.go:212] reloading configuration
I0413 19:20:22.577798       1 nfd-worker.go:423] configuration file "/etc/kubernetes/node-feature-discovery/nfd-worker.conf" parsed
I0413 19:20:22.577941       1 nfd-worker.go:461] worker (re-)configuration successfully completed
I0413 19:20:22.890711       1 nfd-worker.go:472] starting feature discovery...
I0413 19:20:22.975555       1 nfd-worker.go:484] feature discovery completed
I0413 19:20:22.975582       1 nfd-worker.go:565] sending labeling request to nfd-master

I0413 19:21:33.649158       1 nfd-worker.go:212] reloading configuration
I0413 19:21:33.649538       1 nfd-worker.go:423] configuration file "/etc/kubernetes/node-feature-discovery/nfd-worker.conf" parsed
I0413 19:21:33.649604       1 nfd-worker.go:461] worker (re-)configuration successfully completed
I0413 19:21:33.976521       1 nfd-worker.go:472] starting feature discovery...
I0413 19:21:33.976964       1 nfd-worker.go:484] feature discovery completed
I0413 19:21:33.976977       1 nfd-worker.go:565] sending labeling request to nfd-master

I0413 19:23:00.630216       1 nfd-worker.go:212] reloading configuration
I0413 19:23:00.630564       1 nfd-worker.go:423] configuration file "/etc/kubernetes/node-feature-discovery/nfd-worker.conf" parsed
I0413 19:23:00.630616       1 nfd-worker.go:461] worker (re-)configuration successfully completed
I0413 19:23:00.884607       1 nfd-worker.go:472] starting feature discovery...
I0413 19:23:00.885465       1 nfd-worker.go:484] feature discovery completed
I0413 19:23:00.885501       1 nfd-worker.go:565] sending labeling request to nfd-master

Anything else we need to know?:

I'm working on a PR to add a filter arg to utils.CreateFsWatcher which will only return specific fsnotify.Ops.

Environment:

  • Kubernetes version (use kubectl version): 1.21
  • Cloud provider or hardware configuration: baremetal
  • OS (e.g: cat /etc/os-release): RHEL8
  • Install tools: Helm

mac-chaffee avatar Apr 13 '22 19:04 mac-chaffee

Indeed, we should prevent this. But as I mentioned in #806 I think filtering out inotify events is not the right thing to do as breaks a lot of corner cases. My suggestion is to check the raw config data (file content), if it has changed. Either by checksumming or simple bytewise comparison

marquiz avatar Apr 14 '22 06:04 marquiz

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jul 13 '22 07:07 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Aug 12 '22 07:08 k8s-triage-robot

I'm not actually able to reproduce this issue. I see kubelet read/access the file but NFD behaves as expected and doesn't experience any spurious re-labeling because of that. I don't have RHEL 8 though so maybe it's something specific to its kernel or the underlying fs 🧐

marquiz avatar Sep 06 '22 11:09 marquiz

ping @mac-chaffee are you still seeing this? Could it be something that has been fixed after k8s v1.21?

/cc @fmuyassarov

marquiz avatar Sep 30 '22 07:09 marquiz

I've since just disabled NFD, but I'm 99% certain this isn't some ephemeral k8s bug. Kubelet still uses this utility to update confimaps periodically: https://github.com/kubernetes/kubernetes/blob/3ffdfbe286ebcea5d75617da6accaf67f815e0cf/pkg/volume/util/atomic_writer.go

And unless something's changed in NFD's logic for detecting file changes, it still thinks the file changes when that AtomicWriter code executes.

mac-chaffee avatar Sep 30 '22 17:09 mac-chaffee

Also I did test-driven development for #806, so if that test still fails when run against the latest version of nfd, then the bug still exists since the test simulates AtomicWriter.

mac-chaffee avatar Sep 30 '22 17:09 mac-chaffee

Could it be something filesystem or kernel-specific then 🤨 As I haven't really been able to reproduce this. What fs and kernel are you using?

marquiz avatar Oct 03 '22 07:10 marquiz

/remove-lifecycle rotten

fmuyassarov avatar Oct 10 '22 07:10 fmuyassarov

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jan 08 '23 08:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Feb 07 '23 09:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Mar 09 '23 09:03 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Mar 09 '23 09:03 k8s-ci-robot