node-feature-discovery core.sleepInterval ignored; kubelet sync interval causes relabeling every 60 seconds

What happened:

Even if you raise core.sleepInterval above the default of 60 seconds, NFD will still relabel nodes every 60 seconds. This is because the sleepInterval is ignored if the configmap "changes". NFD believes the configmap is being "changed" every 60 seconds because that is kubelet's default --sync-frequency.

Kubelet doesn't actually modify the file; if you just run inotifyd inside an alpine pod with a mounted configmap, you see that kubelet opens, accesses, and closes the file at least every 60 seconds.

What you expected to happen:

nfd only relabels nodes if the configmap was modified, not just opened/accessed/closed by kubelet.

How to reproduce it (as minimally and precisely as possible):

Deploy nfd (via helm) and set core.sleepInterval: 300s
Observe relabeling still happens every 60 seconds or so, with these logs (notice the "reloading configuration" line)

I0413 19:20:22.577223       1 nfd-worker.go:212] reloading configuration
I0413 19:20:22.577798       1 nfd-worker.go:423] configuration file "/etc/kubernetes/node-feature-discovery/nfd-worker.conf" parsed
I0413 19:20:22.577941       1 nfd-worker.go:461] worker (re-)configuration successfully completed
I0413 19:20:22.890711       1 nfd-worker.go:472] starting feature discovery...
I0413 19:20:22.975555       1 nfd-worker.go:484] feature discovery completed
I0413 19:20:22.975582       1 nfd-worker.go:565] sending labeling request to nfd-master

I0413 19:21:33.649158       1 nfd-worker.go:212] reloading configuration
I0413 19:21:33.649538       1 nfd-worker.go:423] configuration file "/etc/kubernetes/node-feature-discovery/nfd-worker.conf" parsed
I0413 19:21:33.649604       1 nfd-worker.go:461] worker (re-)configuration successfully completed
I0413 19:21:33.976521       1 nfd-worker.go:472] starting feature discovery...
I0413 19:21:33.976964       1 nfd-worker.go:484] feature discovery completed
I0413 19:21:33.976977       1 nfd-worker.go:565] sending labeling request to nfd-master

I0413 19:23:00.630216       1 nfd-worker.go:212] reloading configuration
I0413 19:23:00.630564       1 nfd-worker.go:423] configuration file "/etc/kubernetes/node-feature-discovery/nfd-worker.conf" parsed
I0413 19:23:00.630616       1 nfd-worker.go:461] worker (re-)configuration successfully completed
I0413 19:23:00.884607       1 nfd-worker.go:472] starting feature discovery...
I0413 19:23:00.885465       1 nfd-worker.go:484] feature discovery completed
I0413 19:23:00.885501       1 nfd-worker.go:565] sending labeling request to nfd-master

Anything else we need to know?:

I'm working on a PR to add a filter arg to utils.CreateFsWatcher which will only return specific fsnotify.Ops.

Environment:

Kubernetes version (use kubectl version): 1.21
Cloud provider or hardware configuration: baremetal
OS (e.g: cat /etc/os-release): RHEL8
Install tools: Helm

Apr 13 '22 19:04 mac-chaffee

Indeed, we should prevent this. But as I mentioned in #806 I think filtering out inotify events is not the right thing to do as breaks a lot of corner cases. My suggestion is to check the raw config data (file content), if it has changed. Either by checksumming or simple bytewise comparison

Apr 14 '22 06:04 marquiz

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jul 13 '22 07:07 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Aug 12 '22 07:08 k8s-triage-robot

I'm not actually able to reproduce this issue. I see kubelet read/access the file but NFD behaves as expected and doesn't experience any spurious re-labeling because of that. I don't have RHEL 8 though so maybe it's something specific to its kernel or the underlying fs 🧐

Sep 06 '22 11:09 marquiz

ping @mac-chaffee are you still seeing this? Could it be something that has been fixed after k8s v1.21?

/cc @fmuyassarov

Sep 30 '22 07:09 marquiz

I've since just disabled NFD, but I'm 99% certain this isn't some ephemeral k8s bug. Kubelet still uses this utility to update confimaps periodically: https://github.com/kubernetes/kubernetes/blob/3ffdfbe286ebcea5d75617da6accaf67f815e0cf/pkg/volume/util/atomic_writer.go

And unless something's changed in NFD's logic for detecting file changes, it still thinks the file changes when that AtomicWriter code executes.

Sep 30 '22 17:09 mac-chaffee

Also I did test-driven development for #806, so if that test still fails when run against the latest version of nfd, then the bug still exists since the test simulates AtomicWriter.

Sep 30 '22 17:09 mac-chaffee

Could it be something filesystem or kernel-specific then 🤨 As I haven't really been able to reproduce this. What fs and kernel are you using?

Oct 03 '22 07:10 marquiz

/remove-lifecycle rotten

Oct 10 '22 07:10 fmuyassarov

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jan 08 '23 08:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Feb 07 '23 09:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Mar 09 '23 09:03 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Mar 09 '23 09:03 k8s-ci-robot

node-feature-discovery node-feature-discovery copied to clipboard

core.sleepInterval ignored; kubelet sync interval causes relabeling every 60 seconds

node-feature-discovery
node-feature-discovery copied to clipboard