node-feature-discovery icon indicating copy to clipboard operation
node-feature-discovery copied to clipboard

Only reload config when files change (according to modtime)

Open mac-chaffee opened this issue 2 years ago • 13 comments

Fixes #805 by updating utils.FsWatcher to only return events to the caller if the file actually changed, according to the modification time of the file.

We still use fsnotify so we don't poll the filesystem for updates, but now when FsWatcher detects an fsnotify event, it will check the modification time of the file. If the modification time didn't change, the event is ignored.

Using modification times allows us to support mounting the config file via a k8s ConfigMap, since ConfigMaps get re-synced every minute which triggers a lot of fsnotify events, but doesn't actually change the modtime.

mac-chaffee avatar Apr 13 '22 19:04 mac-chaffee

Hi @mac-chaffee. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Apr 13 '22 19:04 k8s-ci-robot

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mac-chaffee To complete the pull request process, please assign marquiz after the PR has been reviewed. You can assign the PR to them by writing /assign @marquiz in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Apr 13 '22 19:04 k8s-ci-robot

Hi @mac-chaffee and thanks for reporting this.

I understand the issue. I'm just don't think this is the right fix 🧐 At a quick glance (on my weary eyes and brain) it looks like you're filtering out the write event but reacting to others. I think we should be interested in write in the general case (think about echo " " >> config.yaml).

Dunno what would be the correct fix. Calculate a hash over the raw data to see if it really changed(?)

marquiz avatar Apr 13 '22 20:04 marquiz

Yeah I got the conditional backwards (copy/paste error). Was wondering why it wasn't working when I tested it haha

Although now that I flipped the conditional, editing the configmap seems to not do anything...

I'll do more testing

/hold

mac-chaffee avatar Apr 13 '22 20:04 mac-chaffee

After sleeping over this I don't think this kind of filtering is the right way to solve the problem. It will just open up corner cases that may bite us later. We wouldn't detect e.g. things like

$ mv new.conf /etc/kubernetes/node-feature-discovery/nfd-worker.conf

or

$ mv /etc/kubernetes/node-feature-discovery/nfd-worker.conf disabled.conf

or

$ chmod 000 /etc/kubernetes/node-feature-discovery/nfd-worker.conf

I think you should diff the raw config and check if it actually changed. Either cache the raw data and do bytewise comparison or calculate (and cache) a checksum

marquiz avatar Apr 14 '22 05:04 marquiz

Viper seems to be able to detect file changes: https://github.com/spf13/viper/blob/65293ecec2f2d7237de9261f637c42d451f94a3d/viper.go#L431

I had just forgotten that configmaps are actually mounted as symlinks and renamed instead of re-written, that was why my testing didn't work.

mac-chaffee avatar Apr 14 '22 19:04 mac-chaffee

I can try to fiddle around with this a little more to hopefully save us the CPU of checksumming every loop, but will resort to that if it gets too complicated

mac-chaffee avatar Apr 14 '22 19:04 mac-chaffee

i'm a bit disinclined to this kind of filtering hack as it breaks corner cases. For example, in principle the configuration could be something else than a configmap, a direct hostpath mount or smth

marquiz avatar Apr 21 '22 11:04 marquiz

/unhold

I changed the implementation to check the modification time of files instead of trying to filter fsnotify events, which proved to be a little too complex.

I figure checking the modification times is more CPU-friendly (just a stat and a time compare) than checksumming (open/read/close + checksum).

mac-chaffee avatar Apr 25 '22 20:04 mac-chaffee

I figure checking the modification times is more CPU-friendly (just a stat and a time compare) than checksumming

Yeah, I think this should actually work. Just briefly looked at this and had one comment on the code. I need to take another pass later on /ok-to-test

marquiz avatar Apr 26 '22 09:04 marquiz

/retest

mac-chaffee avatar May 05 '22 22:05 mac-chaffee

The old code fails the test because it just logs a warning and doesn't trigger a reload:

I don't think it takes that code path. IIRC nfd-worker actually exits if the config file exists but is not readable so that is prolly the reason for the test failing in this case (I'm not sure if that's the right thing to do but that's how it works now I think). Soo... with this in mind caching mtime and checking the readability of the file might work...

marquiz avatar May 09 '22 10:05 marquiz

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Aug 07 '22 10:08 k8s-triage-robot

I figure checking the modification times is more CPU-friendly (just a stat and a time compare) than checksumming

Checking modification time is not robust, because user can make multiple modifications without changing modification time, due to granularity at which time is stored on some of the file systems. You would need to query underlying FS time granularity and check whether update was within that or not.

At least there should be a check for whether file size changed, if modtime did not.

eero-t avatar Aug 24 '22 10:08 eero-t

Checking modification time is not robust, because user can make multiple modifications without changing modification time, due to granularity at which time is stored on some of the file systems.

Good point. Even if this would be a virtually non-existent corner case in our context, seems like another reason I think we should just resort to simple and foolproof checksumming of the file contents 🧐

marquiz avatar Aug 24 '22 15:08 marquiz

@mac-chaffee: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Sep 08 '22 20:09 k8s-ci-robot

@mac-chaffee: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-node-feature-discovery-build-image-generic d2735f8ed80dd30c25b593940ef097b462ad99b3 link true /test pull-node-feature-discovery-build-image-generic
pull-node-feature-discovery-build-gh-pages-generic d2735f8ed80dd30c25b593940ef097b462ad99b3 link true /test pull-node-feature-discovery-build-gh-pages-generic
pull-node-feature-discovery-verify-master d2735f8ed80dd30c25b593940ef097b462ad99b3 link true /test pull-node-feature-discovery-verify-master
pull-node-feature-discovery-verify-docs-master d2735f8ed80dd30c25b593940ef097b462ad99b3 link true /test pull-node-feature-discovery-verify-docs-master
pull-node-feature-discovery-build-image-cross-generic d2735f8ed80dd30c25b593940ef097b462ad99b3 link true /test pull-node-feature-discovery-build-image-cross-generic
pull-node-feature-discovery-build-master d2735f8ed80dd30c25b593940ef097b462ad99b3 link true /test pull-node-feature-discovery-build-master

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-ci-robot avatar Sep 19 '22 19:09 k8s-ci-robot

If you'll only accept a checksum-based solution, then we can close this modtime-based PR. Feel free to reuse some or all of this code.

mac-chaffee avatar Sep 30 '22 17:09 mac-chaffee