container-linux-update-operator icon indicating copy to clipboard operation
container-linux-update-operator copied to clipboard

operator: pause reboots when active alerts are detected

Open lucab opened this issue 8 years ago • 0 comments

Currently update-operator reboots nodes as soon as updates are available. https://github.com/coreos/container-linux-update-operator/issues/82 tracks adding support for a user-configured maintenance window. On top of that, even inside a maintenance window there could be situations where reboots should be temporarily paused (e.g. when some critical/unplanned outage is happening).

This can be currently done by setting a reboot-paused annotation on specific nodes, however this is a manual operation and doesn't scale well cluster-wide.

It would be nice to let CLUO know about any existing AlertManager in the cluster and check for specific active alerts before proceeding. @brancz suggested that we could:

  • take a ConfigMap with critical alerts that should cluster-wide pause reboots (and inotify-watch to hot-reload it)
  • reach the AM on its in-cluster public read-only endpoint and check for non-silenced critical alerts before setting reboot-ok

For clarity, this should be completely orthogonal to maintenance window configuration.

lucab avatar Nov 10 '17 15:11 lucab