etcd-operator icon indicating copy to clipboard operation
etcd-operator copied to clipboard

When on pod failure, reuse data volume.

Open kghost opened this issue 6 years ago • 0 comments

Problem

Currently when a pod fails, operator will create a new pod, then let the new node join the cluster, and sync all data to the new node.

But it is very inconvenient, in case of 3 node cluster, if there are some other node fail during the process, the cluster will be unrecoverable.

We have encountered the problem that under heavy load, master node may stop working for a short period, when it happens, there is an chance that the pod will be kill be liveness check of k8s, if 2 nodes are killed back to back, the cluster wont recover.

Posible solution: restartPolicy of K8S

Using restartPolicy will restart the container when etcd fails. I know there are some problem to solve of using restartPolicy.

When the container is restarted by k8s, it use the original arguments, but to restart the node, requires different arguments. Maybe we can solve the problem by using discovery service. I don't know if discovery service still works after the cluster has been negotiated. But it will be pretty sure that operator must change the discovery key when adding/removing nodes.

Posible solution: use PV

Some times PV is slow, and the extra latency may reduce the performance greatly. So I list it as a solution with some trade off.

kghost avatar Aug 08 '19 12:08 kghost