node-feature-discovery
                                
                                
                                
                                    node-feature-discovery copied to clipboard
                            
                            
                            
                        core.sleepInterval ignored; kubelet sync interval causes relabeling every 60 seconds
What happened:
Even if you raise core.sleepInterval above the default of 60 seconds, NFD will still relabel nodes every 60 seconds. This is because the sleepInterval is ignored if the configmap "changes". NFD believes the configmap is being "changed" every 60 seconds because that is kubelet's default --sync-frequency.
Kubelet doesn't actually modify the file; if you just run inotifyd inside an alpine pod with a mounted configmap, you see that kubelet opens, accesses, and closes the file at least every 60 seconds.
What you expected to happen:
nfd only relabels nodes if the configmap was modified, not just opened/accessed/closed by kubelet.
How to reproduce it (as minimally and precisely as possible):
- Deploy nfd (via helm) and set 
core.sleepInterval: 300s - Observe relabeling still happens every 60 seconds or so, with these logs (notice the "reloading configuration" line)
 
I0413 19:20:22.577223       1 nfd-worker.go:212] reloading configuration
I0413 19:20:22.577798       1 nfd-worker.go:423] configuration file "/etc/kubernetes/node-feature-discovery/nfd-worker.conf" parsed
I0413 19:20:22.577941       1 nfd-worker.go:461] worker (re-)configuration successfully completed
I0413 19:20:22.890711       1 nfd-worker.go:472] starting feature discovery...
I0413 19:20:22.975555       1 nfd-worker.go:484] feature discovery completed
I0413 19:20:22.975582       1 nfd-worker.go:565] sending labeling request to nfd-master
I0413 19:21:33.649158       1 nfd-worker.go:212] reloading configuration
I0413 19:21:33.649538       1 nfd-worker.go:423] configuration file "/etc/kubernetes/node-feature-discovery/nfd-worker.conf" parsed
I0413 19:21:33.649604       1 nfd-worker.go:461] worker (re-)configuration successfully completed
I0413 19:21:33.976521       1 nfd-worker.go:472] starting feature discovery...
I0413 19:21:33.976964       1 nfd-worker.go:484] feature discovery completed
I0413 19:21:33.976977       1 nfd-worker.go:565] sending labeling request to nfd-master
I0413 19:23:00.630216       1 nfd-worker.go:212] reloading configuration
I0413 19:23:00.630564       1 nfd-worker.go:423] configuration file "/etc/kubernetes/node-feature-discovery/nfd-worker.conf" parsed
I0413 19:23:00.630616       1 nfd-worker.go:461] worker (re-)configuration successfully completed
I0413 19:23:00.884607       1 nfd-worker.go:472] starting feature discovery...
I0413 19:23:00.885465       1 nfd-worker.go:484] feature discovery completed
I0413 19:23:00.885501       1 nfd-worker.go:565] sending labeling request to nfd-master
Anything else we need to know?:
I'm working on a PR to add a filter arg to utils.CreateFsWatcher which will only return specific fsnotify.Ops.
Environment:
- Kubernetes version (use 
kubectl version): 1.21 - Cloud provider or hardware configuration: baremetal
 - OS (e.g: 
cat /etc/os-release): RHEL8 - Install tools: Helm
 
Indeed, we should prevent this. But as I mentioned in #806 I think filtering out inotify events is not the right thing to do as breaks a lot of corner cases. My suggestion is to check the raw config data (file content), if it has changed. Either by checksumming or simple bytewise comparison
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity, 
lifecycle/staleis applied - After 30d of inactivity since 
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since 
lifecycle/rottenwas applied, the issue is closed 
You can:
- Mark this issue or PR as fresh with 
/remove-lifecycle stale - Mark this issue or PR as rotten with 
/lifecycle rotten - Close this issue or PR with 
/close - Offer to help out with Issue Triage
 
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity, 
lifecycle/staleis applied - After 30d of inactivity since 
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since 
lifecycle/rottenwas applied, the issue is closed 
You can:
- Mark this issue or PR as fresh with 
/remove-lifecycle rotten - Close this issue or PR with 
/close - Offer to help out with Issue Triage
 
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
I'm not actually able to reproduce this issue. I see kubelet read/access the file but NFD behaves as expected and doesn't experience any spurious re-labeling because of that. I don't have RHEL 8 though so maybe it's something specific to its kernel or the underlying fs 🧐
ping @mac-chaffee are you still seeing this? Could it be something that has been fixed after k8s v1.21?
/cc @fmuyassarov
I've since just disabled NFD, but I'm 99% certain this isn't some ephemeral k8s bug. Kubelet still uses this utility to update confimaps periodically: https://github.com/kubernetes/kubernetes/blob/3ffdfbe286ebcea5d75617da6accaf67f815e0cf/pkg/volume/util/atomic_writer.go
And unless something's changed in NFD's logic for detecting file changes, it still thinks the file changes when that AtomicWriter code executes.
Also I did test-driven development for #806, so if that test still fails when run against the latest version of nfd, then the bug still exists since the test simulates AtomicWriter.
Could it be something filesystem or kernel-specific then 🤨 As I haven't really been able to reproduce this. What fs and kernel are you using?
/remove-lifecycle rotten
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity, 
lifecycle/staleis applied - After 30d of inactivity since 
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since 
lifecycle/rottenwas applied, the issue is closed 
You can:
- Mark this issue or PR as fresh with 
/remove-lifecycle stale - Mark this issue or PR as rotten with 
/lifecycle rotten - Close this issue or PR with 
/close - Offer to help out with Issue Triage
 
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity, 
lifecycle/staleis applied - After 30d of inactivity since 
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since 
lifecycle/rottenwas applied, the issue is closed 
You can:
- Mark this issue or PR as fresh with 
/remove-lifecycle rotten - Close this issue or PR with 
/close - Offer to help out with Issue Triage
 
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity, 
lifecycle/staleis applied - After 30d of inactivity since 
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since 
lifecycle/rottenwas applied, the issue is closed 
You can:
- Reopen this issue with 
/reopen - Mark this issue as fresh with 
/remove-lifecycle rotten - Offer to help out with Issue Triage
 
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
 lifecycle/staleis applied- After 30d of inactivity since
 lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
 lifecycle/rottenwas applied, the issue is closedYou can:
- Reopen this issue with
 /reopen- Mark this issue as fresh with
 /remove-lifecycle rotten- Offer to help out with Issue Triage
 Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.