bottlerocket-update-operator icon indicating copy to clipboard operation
bottlerocket-update-operator copied to clipboard

Allow to configure "crash toleration" of the BRUPOP controller

Open mikn opened this issue 7 months ago • 6 comments

Image I'm using: 1.5.0

Issue or Feature Request: We recently pushed a bad release that didn't boot. Despite the first node crashing and the controller registering this, it continued updating the next node, and so on. The controller will keep crashing nodes ad infinitum unless you stop it, which seems unintuitive. I would propose a new configuration setting that allows you to set a ceiling of "allowed crashes across the cluster" that would pause the controller from performing further updates if it reaches that threshold. Could be called CRASH_TOLERANCE or similar.

Wouldn't mind implementing this if you find this palatable.

mikn avatar May 15 '25 08:05 mikn

Helly @mikn, I agree that Brupop shouldn't allow this situation to happen. I wanted to clarify, are the nodes coming back up Ready and just having pods crashing or are the actual node states crashing? Can you share any logs that showed it detecting the crash and proceeding?

I think a CRASH_TOLERANCE seems like a good mechanism to prevent this. I think that we would be open to a PR that adds this functionality.

yeazelm avatar Jun 06 '25 16:06 yeazelm

As mentioned, the node didn't even boot on the new image and rebooted onto the old image (I didn't capture the logs, but what had happened was that we tried changing the keys used to sign the artefacts to enable secure boot, and missed that the shim and grub verified the embedded signatures despite SB being off). I assume the operator detects a node coming from Reboot onto new image -> old version as a crash?

The crash count in the BRS Resource incremented on each node for every cycle

mikn avatar Jun 06 '25 17:06 mikn

Do you have any thoughts on acceptable default?

mikn avatar Jun 06 '25 17:06 mikn

Do you have any thoughts on acceptable default?

The only other default that is simliar is the MAX_CONCURRENT_UPDATE which defaults to 1. If we set it to 1 for this setting too, it feels like a safe default. It does mean that a single failure will halt updates by default which also feels like the safest default. It might annoy users that a single problematic node can stop an entire cluster from updating, but I feel like that is a better alternative to the behavior you saw on this issue.

yeazelm avatar Jun 06 '25 18:06 yeazelm

I like this crash toleration idea, though it's not immediately clear to me where the operator will persist this. The controller itself doesn't have a custom resource -- it leans on the "shadow" resources created for each node to manage cluster state. An "untrusted" release should clearly be persisted at the cluster/controller level though, rather then stored on a per-node basis.

In order to "color" a release, we'd need some persistent storage for the controller to keep track of versions that it no longer trusts. I just found this issue about creating a "singleton" custom resource. Perhaps this would work to allow the controller to persist some form of state?

Regarding the default value, I am a little concerned about being too aggressive in ceasing to tolerate a given release -- I'm not sure I'm convinced yet that 1 is the right default, since flakiness in existing clusters could result in us effectively ceasing those clusters from being persistent enough to update.

Consider a heterogenous cluster where some specific node struggles to accept updates because of an unfortunate hardware or software configuration. It would be unfortunate to starve the rest of the cluster's update mechanism on this single node's behalf.

Perhaps expressing the tolerance as a ruleset, where we evaluate the last WINDOW_SIZE hosts and tolerate some percentage of them failing to update would allow folks to have some dynamically tolerant rules, while also making it easy to not tolerate any nonsense by setting WINDOW_SIZE to 1.

If we implemented it this way, I think I'd like to advocate for a default ruleset which is more forgiving than 1 to avoid unintentionally harming existing clusters.

cbgbt avatar Jun 06 '25 22:06 cbgbt

The implementation I had in mind was just going to be retrieving all BRS resources and sum across them before it moves ahead with the next node in the group. Could do it with a filter where crash count != 0 to reduce the load on the K8s state store?

I was thinking you'd perhaps want a policy as well, but given the simplicity of a count and the fact that it is a tune:able that you'd set per cluster anyway, I don't really see a huge upside of being able to express more complex policies.

The theoretical scenario I was thinking of was that a few nodes in the middle of a set had unforeseen states in the settings databases and the migrations failed, then you'd probably still want to continue and the simple count would not allow you to be both very defensive in the beginning and allow that to happen.

With the implementation I had in mind, it would be very easy to just delete the BRS resources to continue the rollout - something I would like to be able to do as an admin.

If you have more details or thoughts on the policy language (I understand the thought of WINDOW_SIZE but unsure how it would be reflected in the larger POLICY) and the bigger value of the singleton CRD (I guess you may need it to track the window?), that'd be great however.

The alternative to a singleton CRD is to use annotations on the operator deployment, which isn't too bad :)

mikn avatar Jun 07 '25 05:06 mikn