self-node-remediation
self-node-remediation copied to clipboard
[WIP] Dynamiclly set safe time to assume node rebooted seconds
Why we need this PR
Safe Time to reboot is a configured value, however it can't be lower than a minimum calculated value. The calculated value may differ between clusters so we need a mechanism to override the configured Safe Time to reboot in case it's lower than the calculated value. This conflict can cause to agents crash-loop when the operator is installed since the agents will not not run with an invalid (lower than calculated value) Safe Time to reboot value.
Changes made
- min time to reboot is calculated when snr agent is initialized and set in an annotation on the configuration.
- in case the calculated value is lower than the SafeTime in the configuration spec, the value in the configuration is overridden (this can happen when another field that affect the calculation is changed)
- when SafeTime is changed in the configuration a webhook is used to verify this value against the calculated min value that was set in the config annotation.
Which issue(s) this PR fixes
Test plan
Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: mshitrit
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [mshitrit]
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
/test 4.15-openshift-e2e
/test 4.15-openshift-e2e
/lgtm Giving a chance to get more reviews /hold
- I don't think it's a good idea to set the spec in a controller, even less when it's overwriting a configuration value which can be set by users
I agree it's not ideal, however we will override the value only in 2 use cases:
- When SNR starts and the default value will cause SNR agent crash, at that stage user didn't have a chance to set any value yet and any value that will be set by the user will be verified by the webhook.
- When user changes another configuration field whose change will cause SafeTime to become invalid because minSafe time change (which will also lead to SNR agent crashing)
- Even more, the value we want to set is dynamic, it depends on node specific values (watchdog timeout) and cluster size. So we can potentially have many agents fighting for the "right" value
In case several agents has different values which makes SafeTime invalid, the highest (safest) one will be applied, the annotation used by the webhook may still have a too low value though (we can discuss whether/if to address this later in case it's relevant).
IMO among the alternatives this is the best solution but I think it's worth to share the alternatives in case you disagree or had something else in mind:
- Removing SafeTime from the Spec (major API change)
- Removing default value, so initial value will always be set dynamically
- Doing nothing (keep the crush loop bug)
- Setting a fixed higher default value for SafeTime (will reduce the occurrence of this bug without really fixing it while increasing overall remediation time for small clusters)
we will override the value only in 2 use cases
and when users set a too low value, not?
Removing SafeTime from the Spec (major API change) Removing default value, so initial value will always be set dynamically Doing nothing (keep the crush loop bug) Setting a fixed higher default value for SafeTime (will reduce the occurrence of this bug without really fixing it while increasing overall remediation time for small clusters)
just do not modify the spec but use the status instead, for reporting problems and values which are used instead?
Spec is always the DESIRED state. When it can't be reached, modifying the spec isn't the solution.
and when users set a too low value, not?
In this case it'll be rejected by the webhook.
just do not modify the spec but use the status instead, for reporting problems and values which are used instead?
Hmm I don't quite follow, maybe you can give an example ?
I'm with Marc on this one, operators should not modify the spec. If the value is missing or too low, either:
- refuse to start (taking advantage of crash-loop-backoff)
- refuse to make progress and report an error via the status
- use a calculated value and report a warning via the status
I'm with Marc on this one, operators should not modify the spec. If the value is missing or too low, either:
- refuse to start (taking advantage of crash-loop-backoff)
- refuse to make progress and report an error via the status
- use a calculated value and report a warning via the status
I think option 1 isn't really viable - having the operator crash-loop from the get go with default configuration seems to me like an awfully user experience.
Number 2 and 3 are better in that aspect, but still not great IMO.
How about removing the default value of SafeTime and making it optional ?
- In case it's not assigned - we can use the calculated value
- In case the user tries to assign an invalid valuewe can use the webhook to reject it.
- In case the user manipulate other field making minSafeTime higher than SafeTime we can fallback to option 2 or 3
- Same as above for an upgrade use case where value is already invalid.
No objection, but if you need to fall back to 2 or 3 anyway... why not always use that mechanism?
No objection, but if you need to fall back to 2 or 3 anyway... why not always use that mechanism?
I'm using the fallback to cover edge case which are rare, using this mechanism for all use cases is a terrible user experience IMO.
For example let's say that we always use option 2, consider the following scenario:
- User installs SNR for the first time without modifying any of the defaults
- SNR default safe time is 300S
- For that specific cluster the calculate min Safe Time is 315s
- As a result SNR will not work (and report problem in the status)
I think that as a user it's a reasonable expectation to have the operator working with default configuration which will not be the case here. Best case scenario the user notices it after installation and worst case user assumes that SNR is working. In any case I think it's a bad user experience.
As a result SNR will not work (and report problem in the status)
It can work, just report in the status what you are doing, e.g. using the calculated value. However I agree that the user experience isn't great, when the default value doesn't work...
How about removing the default value of SafeTime and making it optional ?
Wfm. Plus always report the calculated value in the status, plus a warning in case it's higher than the spec's values in case it's filled.
In case several agents has different values which makes SafeTime invalid, the highest (safest) one will be applied
that will make fencing slower than needed on nodes with lower watchdog timeout, but I have no better idea without bigger changes to the process... wfm
Plus always report the calculated value in the status
ATM it's stored in an annotation, is that good enough or do you think it's mandatory to have it in the status ?
plus a warning in case it's higher than the spec's values in case it's filled.
using an example in order to clarify to make sure we mean the same thing:
- in case min Safe time changed (due to change of another field that affects it)
- if the new value is higher than Safe Time
- than add a warning to Config status
do you think it's mandatory to have it in the status
status has better visibilty IMHO, and its fields have docs automatically
in case min Safe time changed (due to change of another field that affects it)
This sounds too complicated and it's unclear to me which field you mean 🤔 I just mean: when the calculated value in the status is bigger than the configured value in the spec, add a warning in the status (beside log and event) that the calculated value overrides the configured one. No matter if it's on creation, update or deletion of any values 🤷🏼♂️
when the calculated value in the status is bigger than the configured value in the spec
That was my intention.
This sounds too complicated and it's unclear to me which field you mean
my intention was that the above scenario will occur because of a change of one of the fields which affects min Safe Time (i.e apiCheckInterval, apiServerTimeout, maxErrorThreshold etc...) because other conditions will be prevented by webhook validation.
/test 4.15-openshift-e2e
/test 4.15-openshift-e2e
/test 4.15-openshift-e2e
/test 4.16-openshift-e2e
PR needs rebase.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
@mshitrit: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
| Test name | Commit | Details | Required | Rerun command |
|---|---|---|---|---|
| ci/prow/4.12-ci-bundle-self-node-remediation-bundle | 04a6964998e806ac4fda07e921c9a29ea1e5e0b9 | link | true | /test 4.12-ci-bundle-self-node-remediation-bundle |
| ci/prow/4.13-ci-bundle-self-node-remediation-bundle | 04a6964998e806ac4fda07e921c9a29ea1e5e0b9 | link | true | /test 4.13-ci-bundle-self-node-remediation-bundle |
| ci/prow/4.14-ci-bundle-self-node-remediation-bundle | 04a6964998e806ac4fda07e921c9a29ea1e5e0b9 | link | true | /test 4.14-ci-bundle-self-node-remediation-bundle |
| ci/prow/4.15-ci-bundle-self-node-remediation-bundle | 04a6964998e806ac4fda07e921c9a29ea1e5e0b9 | link | true | /test 4.15-ci-bundle-self-node-remediation-bundle |
| ci/prow/4.16-ci-bundle-self-node-remediation-bundle | 04a6964998e806ac4fda07e921c9a29ea1e5e0b9 | link | true | /test 4.16-ci-bundle-self-node-remediation-bundle |
Full PR test history. Your PR dashboard.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.
closing in favor of #214