redpanda icon indicating copy to clipboard operation
redpanda copied to clipboard

k8s: implement safe rolling upgrade logic in operator

Open dotnwat opened this issue 4 years ago • 8 comments

The following is the generic upgrade procedure assumed in this document, and is executable manually or automatically by a process such as a k8s operator:

  1. Wait for healthy cluster state via the health monitor service
  2. Select a non-upgraded node and place into maintenance mode
    • This may take some time to complete
    • If a cluster issue occurs
      • Revert maintenance mode
      • Goto (1)
  3. Once a node is in maintenance mode it may be shutdown
  4. Execute node upgrade process
  5. Restart node
  6. Wait for healthy cluster state via the health monitor service
  7. Take node out of maintenance mode
  8. Goto (1)

Additional notes

  • RFC https://docs.google.com/document/d/1FcHekvM62RYRZqZaIvHbKucIQSGuSU2uuP6RW5r-oko/edit#
  • PRD https://docs.google.com/document/d/140T_4Rg9cfIaTuPNlsEW-nfZ3INiLBHUxFy4-HsB3CY/edit#

dotnwat avatar Nov 19 '21 04:11 dotnwat

@joejulian @dotnwat is this ticket still relvant?

jcsp avatar Nov 03 '22 15:11 jcsp

@joejulian is this covered now in the operator?

dotnwat avatar Nov 16 '22 23:11 dotnwat

It doesn't exactly follow those steps. It doesn't wait for anything external before taking it out of maintenance mode. If "Wait for healthy cluster state" could be that it can call its own admin api, then it's probably close enough.

joejulian avatar Nov 17 '22 03:11 joejulian

It sounds like there is still work to do here: checking cluster health before proceeding with upgrades is important for robustness.

Scenario A: unexpected bug

Hypothetical Redpanda version has a bug that causes it to send RPCs to peers that cause the peers to fall over. The upgrade procedure upgrades one node, the upgraded node comes up quite happily, but other nodes start crashing. That should be the signal for the operator to stop the upgrade and roll back.

Scenario B: data recovery

The cluster is under write load. When upgrading (restarting) node 1, node 1 naturally falls behind on writes. Nodes 2+3 are still able to service writes. Then node 2 gets restarted while node 1 is still behind. While node 2 is offline, nodes 1+3 can form a quorum, but cannot service acks=-1 writes yet, because node 1 is behind: it can't service new writes at the tip of the log until it is back online. This manifests as a timeout to producers during the upgrade.

jcsp avatar Nov 21 '22 08:11 jcsp

A: rollback

This would have to roll back the state of the Cluster resource.

  • I'm not sure previous state is actually saved, so this would need added.
  • When the Cluster gets reverted to a last-known-good state, how does the reconciler know to override the cluster-health check? (Cluster condition?)
  • What do we do if the previous configuration doesn't fix it? We should probably throw an event and trigger an alert from such event.

A: Press-on!

What if we didn't revert and, instead, pressed on if the first pod to roll came up healthy but the rest of the cluster fell down? Could we just panic flip all the rest of the pods?

joejulian avatar Nov 22 '22 00:11 joejulian

For B: What signal should we be looking for? When node 1 is rolled, comes back, takes itself out of maintenance mode - what signal does Redpanda give that the operator should be checking before moving on to node 2? It does check v1/status/ready. Is that sufficient?

joejulian avatar Nov 22 '22 01:11 joejulian

For B: What signal should we be looking for? When node 1 is rolled, comes back, takes itself out of maintenance mode - what signal does Redpanda give that the operator should be checking before moving on to node 2? It does check v1/status/ready. Is that sufficient?

The node readiness endpoint is not sufficient. /v1/status/ready is only telling you that the node you touched is up (it's internally just a bool that gets set after the node opens its kafka listener). For a safe upgrade, the essential check is that the overall cluster health is good: this includes things like:

  • Are the other nodes up? (i.e. did something fall over as in scenario A)
  • Are any partitions behind on replication? (i.e. do we need to wait to avoid scenario B)

/v1/cluster/health_overview is what gives you that cluster-wide status. It's not perfect (it's always possible for something to go wrong between the health GET and the actual upgrade), but gives an excellent chance of backing off if something has gone dramatically wrong. Currently the main things it reports on are whether any nodes are down and whether any partitions are leaderless, but it will be the place in future that we can extend to give that strong "scenario B" check that data replication is up to date.

@dotnwat please keep me honest: does this line up with recent discussions on local disk storage etc?

jcsp avatar Nov 22 '22 09:11 jcsp

A: Press-on!

I don't recommend this. On a major version upgrade, if you give up on a rolling upgrade and flash forward to updating all the nodes, then new feature flags will activate, and the cluster will start writing new-format data to disk. At this point the door slams shut for rolling back.

jcsp avatar Nov 22 '22 09:11 jcsp

@dotnwat please keep me honest: does this line up with recent discussions on local disk storage etc?

@jcsp yes. /v1/status/ready is certainly not sufficient. in fact, we didn't have a single endpoint that would be sufficient for ephemeral disk scenario so the proposal was a function of a couple endpoints until core could enhance the existing health endpoint to be sufficient. this was written down some where in the context of the tt-local-disk channel on slack. i can't seem to find it right now, but I will look in the AM.

dotnwat avatar Nov 23 '22 05:11 dotnwat

It's done by:

  • https://github.com/redpanda-data/redpanda/pull/7530
  • https://github.com/redpanda-data/redpanda/pull/7594
  • https://github.com/redpanda-data/redpanda/pull/7528
  • and previous implementation of the rolling update/upgrade

RafalKorepta avatar Jan 08 '23 18:01 RafalKorepta