redpanda k8s: implement safe rolling upgrade logic in operator

The following is the generic upgrade procedure assumed in this document, and is executable manually or automatically by a process such as a k8s operator:

Wait for healthy cluster state via the health monitor service
Select a non-upgraded node and place into maintenance mode
- This may take some time to complete
- If a cluster issue occurs
  - Revert maintenance mode
  - Goto (1)
Once a node is in maintenance mode it may be shutdown
Execute node upgrade process
Restart node
Wait for healthy cluster state via the health monitor service
Take node out of maintenance mode
Goto (1)

Additional notes

RFC https://docs.google.com/document/d/1FcHekvM62RYRZqZaIvHbKucIQSGuSU2uuP6RW5r-oko/edit#
PRD https://docs.google.com/document/d/140T_4Rg9cfIaTuPNlsEW-nfZ3INiLBHUxFy4-HsB3CY/edit#

Nov 19 '21 04:11 dotnwat

@joejulian @dotnwat is this ticket still relvant?

Nov 03 '22 15:11 jcsp

@joejulian is this covered now in the operator?

Nov 16 '22 23:11 dotnwat

It doesn't exactly follow those steps. It doesn't wait for anything external before taking it out of maintenance mode. If "Wait for healthy cluster state" could be that it can call its own admin api, then it's probably close enough.

Nov 17 '22 03:11 joejulian

It sounds like there is still work to do here: checking cluster health before proceeding with upgrades is important for robustness.

Scenario A: unexpected bug

Hypothetical Redpanda version has a bug that causes it to send RPCs to peers that cause the peers to fall over. The upgrade procedure upgrades one node, the upgraded node comes up quite happily, but other nodes start crashing. That should be the signal for the operator to stop the upgrade and roll back.

Scenario B: data recovery

The cluster is under write load. When upgrading (restarting) node 1, node 1 naturally falls behind on writes. Nodes 2+3 are still able to service writes. Then node 2 gets restarted while node 1 is still behind. While node 2 is offline, nodes 1+3 can form a quorum, but cannot service acks=-1 writes yet, because node 1 is behind: it can't service new writes at the tip of the log until it is back online. This manifests as a timeout to producers during the upgrade.

Nov 21 '22 08:11 jcsp

A: rollback

This would have to roll back the state of the Cluster resource.

I'm not sure previous state is actually saved, so this would need added.
When the Cluster gets reverted to a last-known-good state, how does the reconciler know to override the cluster-health check? (Cluster condition?)
What do we do if the previous configuration doesn't fix it? We should probably throw an event and trigger an alert from such event.

A: Press-on!

What if we didn't revert and, instead, pressed on if the first pod to roll came up healthy but the rest of the cluster fell down? Could we just panic flip all the rest of the pods?

Nov 22 '22 00:11 joejulian

For B: What signal should we be looking for? When node 1 is rolled, comes back, takes itself out of maintenance mode - what signal does Redpanda give that the operator should be checking before moving on to node 2? It does check v1/status/ready. Is that sufficient?

Nov 22 '22 01:11 joejulian

For B: What signal should we be looking for? When node 1 is rolled, comes back, takes itself out of maintenance mode - what signal does Redpanda give that the operator should be checking before moving on to node 2? It does check v1/status/ready. Is that sufficient?

The node readiness endpoint is not sufficient. /v1/status/ready is only telling you that the node you touched is up (it's internally just a bool that gets set after the node opens its kafka listener). For a safe upgrade, the essential check is that the overall cluster health is good: this includes things like:

Are the other nodes up? (i.e. did something fall over as in scenario A)
Are any partitions behind on replication? (i.e. do we need to wait to avoid scenario B)

/v1/cluster/health_overview is what gives you that cluster-wide status. It's not perfect (it's always possible for something to go wrong between the health GET and the actual upgrade), but gives an excellent chance of backing off if something has gone dramatically wrong. Currently the main things it reports on are whether any nodes are down and whether any partitions are leaderless, but it will be the place in future that we can extend to give that strong "scenario B" check that data replication is up to date.

@dotnwat please keep me honest: does this line up with recent discussions on local disk storage etc?

Nov 22 '22 09:11 jcsp

A: Press-on!

I don't recommend this. On a major version upgrade, if you give up on a rolling upgrade and flash forward to updating all the nodes, then new feature flags will activate, and the cluster will start writing new-format data to disk. At this point the door slams shut for rolling back.

Nov 22 '22 09:11 jcsp

@dotnwat please keep me honest: does this line up with recent discussions on local disk storage etc?

@jcsp yes. /v1/status/ready is certainly not sufficient. in fact, we didn't have a single endpoint that would be sufficient for ephemeral disk scenario so the proposal was a function of a couple endpoints until core could enhance the existing health endpoint to be sufficient. this was written down some where in the context of the tt-local-disk channel on slack. i can't seem to find it right now, but I will look in the AM.

Nov 23 '22 05:11 dotnwat

It's done by:

https://github.com/redpanda-data/redpanda/pull/7530
https://github.com/redpanda-data/redpanda/pull/7594
https://github.com/redpanda-data/redpanda/pull/7528
and previous implementation of the rolling update/upgrade

Jan 08 '23 18:01 RafalKorepta

redpanda redpanda copied to clipboard

k8s: implement safe rolling upgrade logic in operator

Additional notes

Scenario A: unexpected bug

Scenario B: data recovery

A: rollback

A: Press-on!

redpanda
redpanda copied to clipboard