cloud-on-k8s
cloud-on-k8s copied to clipboard
Support smoother k8s nodes rotation when using local volumes
When using local volumes, it can be quite complicated to handle Kubernetes nodes upgrades. One common way to upgrade a k8s node is to take it out of the cluster, and replace it with a fresh new one. In which case the local volume is lost, and the corresponding Elasticsearch Pod stays Pending forever.
When that happens, the only way out is to manually remove both Pod and PVC, so a new Pod gets created with a new volume.
In an ideal world to simplify this, we would like to:
- migrate data away from the ES node that will be removed (the k8s node is probably being drained at k8s level already)
- once that node is removed, and the corresponding Pod becomes Pending, ECK would delete both Pod and PVC so they are recreated elsewhere
- this is a mode of operation the user would probably have to indicate somewhere (in the Elasticsearch spec?). Doing it automatically feels complicated (how long should we wait? will the node come back?) and dangerous.
Related discuss issue: https://discuss.elastic.co/t/does-eck-support-local-persistent-disks-and-is-it-a-good-idea/223515/3
I am not sure whether this is helpful at all because it's at such a high level, but I think the ECK operator could watch the ES data nodes and it's corresponding kubernetes nodes.
In the moment where let's say data-node-0 is no longer scheduled to k8s-node-abc but another node (for whatever reason) you can assume that this Elastic node has lost it's data. If this is the case the ECK operator can delete / recreate the PVC so that the pod is no longer pending anymore.
Does that make sense or am I missing something?
this simple script seems work, but not proven in prod.
kubectl cordon k8s-node-abc
kubectl delete pvc -es-xxx --force --grace-period=0
kubectl drain k8s-node-abc --delete-local-data --ignore-daemonsets
kubectl uncordon k8s-node-abc
Relates to https://github.com/elastic/cloud-on-k8s/issues/2448.
We've run into this exact issue two times now. When we try to upgrade the k8s version in our nodepool, we lose all our data and the cluster goes into a completely broken state.
I don't know how it works with other providers, but I can speak for GKE. We have a cluster with 3 nodes and an index with 2 shards and 1 replica per shard
What I believe happens is the following:
- GKE initiates a node pool version upgrade
- A node is drained and its pod is deleted along with local data
- A new node spins up with a new pod
- The new pod starts receiving data from replica shards stored on the other nodes
- GKE respects the pod disruption budget of max 1 unavailable pod, but its patience dies out at 1 hour and after 1 hour it continues the upgrade with a new node ("Note: During automatic or manual node upgrades, PDBs are respected for a maximum of 1 hour. If Pods running on a node cannot be scheduled onto new nodes within 1 hour, the upgrade is initiated, regardless.", from here)
- A new node is drained (now 2/3 unhealthy) and everything is a mess
The logs from GKE show that almost exactly one hour passes between each node teardown.
Hello,
Here's our approach to upgrade k8s on local storage node groups:
- create a k8s upgraded node group with a dedicated label (let's say
group: beta) - change the
nameand thenodeSelectorlabel of thenodeSetto upgrade, and patch theupdateStrategyas following
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
name: cluster
spec:
version: 8.3.3
# Add before removing, ensuring no data is ever lost
updateStrategy:
changeBudget:
maxSurge: 1
maxUnavailable: 0
nodeSets:
# Change name to beta (required)
- name: alpha
count: 2
podTemplate:
spec:
# Plug on 1 group
# Change to beta
nodeSelector:
group: alpha
- delete the old node group once all shards has been migrated, and revert the
updateStrategy