flatcar-linux-update-operator icon indicating copy to clipboard operation
flatcar-linux-update-operator copied to clipboard

[RFE] Allow for specifying the Flatcar OS version or to extend validation steps

Open ewassef opened this issue 11 months ago • 2 comments

Current situation

The latest version (3815.2.0) had some significant changes with upstream systems that causes outages. The auto update process will validate the update of the OS has successfully completed but doesn't allow for subsequent checks easily (or at least not through k8s). This causes outages and the only way to fix is to get into each node and manually rollback to the previous version and then pause updates.

Impact

This causes outages and the only way to fix is to get into each node and manually rollback to the previous version and then pause updates.

Ideal future situation

Allow for OS version pinning via an annotation. OR Allow for a CM with additional scripts that can be used to verify successful updates

Implementation options

annotation with pinned version that is passed via the DBus to the upate-agent to call flatcar-update

OR

some mechanism (maybe also over DBus) to send down a script that can update the after reboot checks and trigger a rollback if failed. This is not the same as the https://www.flatcar.org/docs/latest/setup/releases/update-strategies/#configure-a-post-install-update-hook hook as this will keep the node in a bad state (although, this is a good final catch)

Additional information

ewassef avatar Mar 07 '24 20:03 ewassef

Allow for a CM with additional scripts that can be used to verify successful updates

Have you seen the after-reboot checks? Docs here https://github.com/flatcar/flatcar-linux-update-operator/blob/030e43574c229eeb5a8858f03bdcc997f38131d9/doc/before-after-reboot-checks.md and example daemonset here: https://github.com/flatcar/flatcar-linux-update-operator/blob/030e43574c229eeb5a8858f03bdcc997f38131d9/examples/reboot-annotations/after-reboot-daemonset.yaml.

You also have the option of defining a health check on the node level as a systemd service and making it a dependency of update_engine (and kubelet) at the systemd level https://www.flatcar.org/docs/latest/setup/debug/manual-rollbacks/#automated-rollbacks. That way the node automatically performs a rollback when you reboot it from a failed update.

I'm also interested in finding out more about the issues you faced:

The latest version (3815.2.0) had some significant changes with upstream systems that causes outages.

If I recall correctly you had issues with containerd not launching correctly. Where there others?

Here's an example of what could have worked in this case (you would need to evaluate the level of dependency required, Requires= or BindsTo=). If you defined containerd as a dependency of kubelet and both kubelet and containerd as a dependency of update_engine:

  • update_engine would not mark the update as successful and the node would reboot into the previous Flatcar version on failure
  • kubelet would not start and the node would not show up as having completed the update in FLUO. FLUO would have prevented more nodes from rebooting because it uses a default max of 1 node rebooting at a time.

jepio avatar Mar 08 '24 14:03 jepio

Thanks for helping out @jepio. I was going to suggest a custom dependency to update_engine service too, since as far as I know this is the official mechanism for extending self-updates validation on Flatcar. @ewassef and other upvoters, could you try it out if it solves your issue?

invidian avatar Mar 24 '24 16:03 invidian