self-node-remediation [WIP] Dynamiclly set safe time to assume node rebooted seconds

Why we need this PR

Safe Time to reboot is a configured value, however it can't be lower than a minimum calculated value. The calculated value may differ between clusters so we need a mechanism to override the configured Safe Time to reboot in case it's lower than the calculated value. This conflict can cause to agents crash-loop when the operator is installed since the agents will not not run with an invalid (lower than calculated value) Safe Time to reboot value.

Changes made

min time to reboot is calculated when snr agent is initialized and set in an annotation on the configuration.
in case the calculated value is lower than the SafeTime in the configuration spec, the value in the configuration is overridden (this can happen when another field that affect the calculation is changed)
when SafeTime is changed in the configuration a webhook is used to verify this value against the calculated min value that was set in the config annotation.

Which issue(s) this PR fixes

ECOPROJECT-1875

Test plan

Apr 18 '24 12:04 mshitrit

Skipping CI for Draft Pull Request. If you want CI signal for your change, please convert it to an actual PR. You can still manually trigger a test run with /test all

Apr 18 '24 12:04 openshift-ci[bot]

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mshitrit

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [mshitrit]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Apr 18 '24 12:04 openshift-ci[bot]

/test 4.15-openshift-e2e

Apr 19 '24 15:04 mshitrit

/test 4.15-openshift-e2e

Apr 25 '24 13:04 mshitrit

/lgtm Giving a chance to get more reviews /hold

Apr 28 '24 09:04 clobrano

I don't think it's a good idea to set the spec in a controller, even less when it's overwriting a configuration value which can be set by users

I agree it's not ideal, however we will override the value only in 2 use cases:

When SNR starts and the default value will cause SNR agent crash, at that stage user didn't have a chance to set any value yet and any value that will be set by the user will be verified by the webhook.
When user changes another configuration field whose change will cause SafeTime to become invalid because minSafe time change (which will also lead to SNR agent crashing)

Even more, the value we want to set is dynamic, it depends on node specific values (watchdog timeout) and cluster size. So we can potentially have many agents fighting for the "right" value

In case several agents has different values which makes SafeTime invalid, the highest (safest) one will be applied, the annotation used by the webhook may still have a too low value though (we can discuss whether/if to address this later in case it's relevant).

IMO among the alternatives this is the best solution but I think it's worth to share the alternatives in case you disagree or had something else in mind:

Removing SafeTime from the Spec (major API change)
Removing default value, so initial value will always be set dynamically
Doing nothing (keep the crush loop bug)
Setting a fixed higher default value for SafeTime (will reduce the occurrence of this bug without really fixing it while increasing overall remediation time for small clusters)

Apr 30 '24 10:04 mshitrit

we will override the value only in 2 use cases

and when users set a too low value, not?

Removing SafeTime from the Spec (major API change) Removing default value, so initial value will always be set dynamically Doing nothing (keep the crush loop bug) Setting a fixed higher default value for SafeTime (will reduce the occurrence of this bug without really fixing it while increasing overall remediation time for small clusters)

just do not modify the spec but use the status instead, for reporting problems and values which are used instead?

Spec is always the DESIRED state. When it can't be reached, modifying the spec isn't the solution.

Apr 30 '24 11:04 slintes

and when users set a too low value, not?

In this case it'll be rejected by the webhook.

just do not modify the spec but use the status instead, for reporting problems and values which are used instead?

Hmm I don't quite follow, maybe you can give an example ?

Apr 30 '24 15:04 mshitrit

I'm with Marc on this one, operators should not modify the spec. If the value is missing or too low, either:

refuse to start (taking advantage of crash-loop-backoff)
refuse to make progress and report an error via the status
use a calculated value and report a warning via the status

May 01 '24 01:05 beekhof

I'm with Marc on this one, operators should not modify the spec. If the value is missing or too low, either:

refuse to start (taking advantage of crash-loop-backoff)

refuse to make progress and report an error via the status

use a calculated value and report a warning via the status

I think option 1 isn't really viable - having the operator crash-loop from the get go with default configuration seems to me like an awfully user experience.

Number 2 and 3 are better in that aspect, but still not great IMO.

How about removing the default value of SafeTime and making it optional ?

In case it's not assigned - we can use the calculated value
In case the user tries to assign an invalid valuewe can use the webhook to reject it.
In case the user manipulate other field making minSafeTime higher than SafeTime we can fallback to option 2 or 3
Same as above for an upgrade use case where value is already invalid.

May 01 '24 06:05 mshitrit

No objection, but if you need to fall back to 2 or 3 anyway... why not always use that mechanism?

May 02 '24 02:05 beekhof

No objection, but if you need to fall back to 2 or 3 anyway... why not always use that mechanism?

I'm using the fallback to cover edge case which are rare, using this mechanism for all use cases is a terrible user experience IMO.

For example let's say that we always use option 2, consider the following scenario:

User installs SNR for the first time without modifying any of the defaults
SNR default safe time is 300S
For that specific cluster the calculate min Safe Time is 315s
As a result SNR will not work (and report problem in the status)

I think that as a user it's a reasonable expectation to have the operator working with default configuration which will not be the case here. Best case scenario the user notices it after installation and worst case user assumes that SNR is working. In any case I think it's a bad user experience.

May 02 '24 05:05 mshitrit

As a result SNR will not work (and report problem in the status)

It can work, just report in the status what you are doing, e.g. using the calculated value. However I agree that the user experience isn't great, when the default value doesn't work...

How about removing the default value of SafeTime and making it optional ?

Wfm. Plus always report the calculated value in the status, plus a warning in case it's higher than the spec's values in case it's filled.

In case several agents has different values which makes SafeTime invalid, the highest (safest) one will be applied

that will make fencing slower than needed on nodes with lower watchdog timeout, but I have no better idea without bigger changes to the process... wfm

May 02 '24 08:05 slintes

Plus always report the calculated value in the status

ATM it's stored in an annotation, is that good enough or do you think it's mandatory to have it in the status ?

plus a warning in case it's higher than the spec's values in case it's filled.

using an example in order to clarify to make sure we mean the same thing:

in case min Safe time changed (due to change of another field that affects it)
if the new value is higher than Safe Time
than add a warning to Config status

May 02 '24 11:05 mshitrit

do you think it's mandatory to have it in the status

status has better visibilty IMHO, and its fields have docs automatically

in case min Safe time changed (due to change of another field that affects it)

This sounds too complicated and it's unclear to me which field you mean 🤔 I just mean: when the calculated value in the status is bigger than the configured value in the spec, add a warning in the status (beside log and event) that the calculated value overrides the configured one. No matter if it's on creation, update or deletion of any values 🤷🏼‍♂️

May 02 '24 12:05 slintes

when the calculated value in the status is bigger than the configured value in the spec

That was my intention.

This sounds too complicated and it's unclear to me which field you mean

my intention was that the above scenario will occur because of a change of one of the fields which affects min Safe Time (i.e apiCheckInterval, apiServerTimeout, maxErrorThreshold etc...) because other conditions will be prevented by webhook validation.

May 05 '24 11:05 mshitrit

/test 4.15-openshift-e2e

May 06 '24 16:05 mshitrit

/test 4.15-openshift-e2e

May 06 '24 18:05 mshitrit

/test 4.15-openshift-e2e

May 08 '24 13:05 mshitrit

/test 4.16-openshift-e2e

May 28 '24 10:05 mshitrit

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Jun 05 '24 21:06 openshift-merge-robot

@mshitrit: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/4.12-ci-bundle-self-node-remediation-bundle	04a6964998e806ac4fda07e921c9a29ea1e5e0b9	link	true	`/test 4.12-ci-bundle-self-node-remediation-bundle`
ci/prow/4.13-ci-bundle-self-node-remediation-bundle	04a6964998e806ac4fda07e921c9a29ea1e5e0b9	link	true	`/test 4.13-ci-bundle-self-node-remediation-bundle`
ci/prow/4.14-ci-bundle-self-node-remediation-bundle	04a6964998e806ac4fda07e921c9a29ea1e5e0b9	link	true	`/test 4.14-ci-bundle-self-node-remediation-bundle`
ci/prow/4.15-ci-bundle-self-node-remediation-bundle	04a6964998e806ac4fda07e921c9a29ea1e5e0b9	link	true	`/test 4.15-ci-bundle-self-node-remediation-bundle`
ci/prow/4.16-ci-bundle-self-node-remediation-bundle	04a6964998e806ac4fda07e921c9a29ea1e5e0b9	link	true	`/test 4.16-ci-bundle-self-node-remediation-bundle`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Jun 06 '24 10:06 openshift-ci[bot]

closing in favor of #214

Jun 24 '24 10:06 mshitrit

self-node-remediation self-node-remediation copied to clipboard

[WIP] Dynamiclly set safe time to assume node rebooted seconds

Why we need this PR

Changes made

Which issue(s) this PR fixes

Test plan

self-node-remediation
self-node-remediation copied to clipboard