cloud-on-k8s Version upgrade might not finish if v6 nodes leave the cluster

Version upgrade might not finish if v6 nodes leave the cluster

Open david-kow opened this issue 5 years ago • 1 comments

When we get network partition that isolates upgraded nodes, the upgrade process can finish and not upgraded nodes can’t join back the cluster. We might not become green after that. Since the "right" network partition during a version upgrade might be a rare event, we might consider this as something to be resolved manually.

Aug 28 '19 08:08 david-kow

The following corner case can happen when upgrading a cluster from v6 to v7:

3/5 nodes are upgraded
the 2 remaining v6 nodes are disconnected from the cluster for whatever reason (eg. blame the network)
as a result, a node in v7 becomes the new master
the cluster is now using zen2, and not zen1
the 2 v6 nodes will attempt to join the cluster again when they're back online, but will fail
rolling upgrade will be stuck at this point, since the cluster may never become green (one of our pre-conditions to move on with the upgrade)

To handle that case, we probably need to still force-upgrade the remaining v6 nodes if we detect the cluster is using zen2 already.

See the original discussion in https://github.com/elastic/cloud-on-k8s/issues/1281#issuecomment-524768463.

A workaround right now if this is encountered is to manually delete the remaining v6 pods, so they are recreated automatically with the latest revision (v7).

Sep 26 '19 07:09 sebgl

Closing this as this is not a bug in the ECK operator but a limitation of the how Elasticsearch upgrades work and there is not automated remediation of the described situation.

Mar 05 '23 16:03 pebrc

cloud-on-k8s cloud-on-k8s copied to clipboard

Version upgrade might not finish if v6 nodes leave the cluster

cloud-on-k8s
cloud-on-k8s copied to clipboard