node-feature-discovery
node-feature-discovery copied to clipboard
Only reload config when files change (according to modtime)
Fixes #805 by updating utils.FsWatcher
to only return events to the caller if the file actually changed, according to the modification time of the file.
We still use fsnotify
so we don't poll the filesystem for updates, but now when FsWatcher
detects an fsnotify event, it will check the modification time of the file. If the modification time didn't change, the event is ignored.
Using modification times allows us to support mounting the config file via a k8s ConfigMap, since ConfigMaps get re-synced every minute which triggers a lot of fsnotify events, but doesn't actually change the modtime.
Hi @mac-chaffee. Thanks for your PR.
I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test
on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.
Once the patch is verified, the new status will be reflected by the ok-to-test
label.
I understand the commands that are listed here.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: mac-chaffee
To complete the pull request process, please assign marquiz after the PR has been reviewed.
You can assign the PR to them by writing /assign @marquiz
in a comment when ready.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve
in a comment
Approvers can cancel approval by writing /approve cancel
in a comment
Hi @mac-chaffee and thanks for reporting this.
I understand the issue. I'm just don't think this is the right fix 🧐 At a quick glance (on my weary eyes and brain) it looks like you're filtering out the write event but reacting to others. I think we should be interested in write in the general case (think about echo " " >> config.yaml
).
Dunno what would be the correct fix. Calculate a hash over the raw data to see if it really changed(?)
Yeah I got the conditional backwards (copy/paste error). Was wondering why it wasn't working when I tested it haha
Although now that I flipped the conditional, editing the configmap seems to not do anything...
I'll do more testing
/hold
After sleeping over this I don't think this kind of filtering is the right way to solve the problem. It will just open up corner cases that may bite us later. We wouldn't detect e.g. things like
$ mv new.conf /etc/kubernetes/node-feature-discovery/nfd-worker.conf
or
$ mv /etc/kubernetes/node-feature-discovery/nfd-worker.conf disabled.conf
or
$ chmod 000 /etc/kubernetes/node-feature-discovery/nfd-worker.conf
I think you should diff the raw config and check if it actually changed. Either cache the raw data and do bytewise comparison or calculate (and cache) a checksum
Viper seems to be able to detect file changes: https://github.com/spf13/viper/blob/65293ecec2f2d7237de9261f637c42d451f94a3d/viper.go#L431
I had just forgotten that configmaps are actually mounted as symlinks and renamed instead of re-written, that was why my testing didn't work.
I can try to fiddle around with this a little more to hopefully save us the CPU of checksumming every loop, but will resort to that if it gets too complicated
i'm a bit disinclined to this kind of filtering hack as it breaks corner cases. For example, in principle the configuration could be something else than a configmap, a direct hostpath mount or smth
/unhold
I changed the implementation to check the modification time of files instead of trying to filter fsnotify events, which proved to be a little too complex.
I figure checking the modification times is more CPU-friendly (just a stat
and a time compare) than checksumming (open/read/close + checksum
).
I figure checking the modification times is more CPU-friendly (just a stat and a time compare) than checksumming
Yeah, I think this should actually work. Just briefly looked at this and had one comment on the code. I need to take another pass later on /ok-to-test
/retest
The old code fails the test because it just logs a warning and doesn't trigger a reload:
I don't think it takes that code path. IIRC nfd-worker actually exits if the config file exists but is not readable so that is prolly the reason for the test failing in this case (I'm not sure if that's the right thing to do but that's how it works now I think). Soo... with this in mind caching mtime and checking the readability of the file might work...
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
I figure checking the modification times is more CPU-friendly (just a stat and a time compare) than checksumming
Checking modification time is not robust, because user can make multiple modifications without changing modification time, due to granularity at which time is stored on some of the file systems. You would need to query underlying FS time granularity and check whether update was within that or not.
At least there should be a check for whether file size changed, if modtime did not.
Checking modification time is not robust, because user can make multiple modifications without changing modification time, due to granularity at which time is stored on some of the file systems.
Good point. Even if this would be a virtually non-existent corner case in our context, seems like another reason I think we should just resort to simple and foolproof checksumming of the file contents 🧐
@mac-chaffee: PR needs rebase.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
@mac-chaffee: The following tests failed, say /retest
to rerun all failed tests or /retest-required
to rerun all mandatory failed tests:
Test name | Commit | Details | Required | Rerun command |
---|---|---|---|---|
pull-node-feature-discovery-build-image-generic | d2735f8ed80dd30c25b593940ef097b462ad99b3 | link | true | /test pull-node-feature-discovery-build-image-generic |
pull-node-feature-discovery-build-gh-pages-generic | d2735f8ed80dd30c25b593940ef097b462ad99b3 | link | true | /test pull-node-feature-discovery-build-gh-pages-generic |
pull-node-feature-discovery-verify-master | d2735f8ed80dd30c25b593940ef097b462ad99b3 | link | true | /test pull-node-feature-discovery-verify-master |
pull-node-feature-discovery-verify-docs-master | d2735f8ed80dd30c25b593940ef097b462ad99b3 | link | true | /test pull-node-feature-discovery-verify-docs-master |
pull-node-feature-discovery-build-image-cross-generic | d2735f8ed80dd30c25b593940ef097b462ad99b3 | link | true | /test pull-node-feature-discovery-build-image-cross-generic |
pull-node-feature-discovery-build-master | d2735f8ed80dd30c25b593940ef097b462ad99b3 | link | true | /test pull-node-feature-discovery-build-master |
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.
If you'll only accept a checksum-based solution, then we can close this modtime-based PR. Feel free to reuse some or all of this code.