cloud-on-k8s
cloud-on-k8s copied to clipboard
Version upgrade might not finish if v6 nodes leave the cluster
When we get network partition that isolates upgraded nodes, the upgrade process can finish and not upgraded nodes can’t join back the cluster. We might not become green after that. Since the "right" network partition during a version upgrade might be a rare event, we might consider this as something to be resolved manually.
The following corner case can happen when upgrading a cluster from v6 to v7:
- 3/5 nodes are upgraded
- the 2 remaining v6 nodes are disconnected from the cluster for whatever reason (eg. blame the network)
- as a result, a node in v7 becomes the new master
- the cluster is now using zen2, and not zen1
- the 2 v6 nodes will attempt to join the cluster again when they're back online, but will fail
- rolling upgrade will be stuck at this point, since the cluster may never become green (one of our pre-conditions to move on with the upgrade)
To handle that case, we probably need to still force-upgrade the remaining v6 nodes if we detect the cluster is using zen2 already.
See the original discussion in https://github.com/elastic/cloud-on-k8s/issues/1281#issuecomment-524768463.
A workaround right now if this is encountered is to manually delete the remaining v6 pods, so they are recreated automatically with the latest revision (v7).
Closing this as this is not a bug in the ECK operator but a limitation of the how Elasticsearch upgrades work and there is not automated remediation of the described situation.