bottlerocket-update-operator
bottlerocket-update-operator copied to clipboard
Allow to configure "crash toleration" of the BRUPOP controller
Image I'm using: 1.5.0
Issue or Feature Request: We recently pushed a bad release that didn't boot. Despite the first node crashing and the controller registering this, it continued updating the next node, and so on. The controller will keep crashing nodes ad infinitum unless you stop it, which seems unintuitive. I would propose a new configuration setting that allows you to set a ceiling of "allowed crashes across the cluster" that would pause the controller from performing further updates if it reaches that threshold.
Could be called CRASH_TOLERANCE or similar.
Wouldn't mind implementing this if you find this palatable.