charts icon indicating copy to clipboard operation
charts copied to clipboard

[bitnami/etcd] member can't join cluster

Open roy-work opened this issue 1 year ago • 20 comments

Name and Version

bitnami/etcd 8.8.0

What architecture are you using?

amd64

What steps will reproduce the bug?

We appear to have run into this issue.

There's not much in the way of good information in that bug, so in an attempt to work around the issue, I deleted the node & wiped its PVC/PV; I figured a clean slate should do it, and it could replicate from the other two.

Now, it cannot start with:

etcd-0 etcd etcd 18:06:20.81 INFO  ==> Member ID wasn't properly stored, the member will try to join the cluster by it's own
[… later …]
etcd-0 etcd {"level":"warn","ts":"2023-04-14T18:06:23.387Z","caller":"etcdserver/server.go:1127","msg":"server error","error":"the member has been permanently removed from the cluster"}

This mirrors this issue. (Also supposedly fixed.)

How can I get this etcd member to rejoin the cluster?

Are you using any custom parameters or values?

No response

What is the expected behavior?

Brand new Sts replicas should just join the existing cluster.

What do you see instead?

They cannot, as they were "permanently removed".

Additional information

No response

roy-work avatar Apr 14 '23 18:04 roy-work

Hi,

Did you change any of the default values? Which is the Kubernetes platform you are using to deploy the solution?

javsalgar avatar Apr 17 '23 07:04 javsalgar

A few:

replicaCount: 3
metrics:
  enabled: true
persistence:
  enabled: true
  size: 1Gi
auth:
  rbac:
    allowNoneAuthentication: false
    existingSecret: "etcd-password"
    existingSecretPasswordKey: "etcd-password"
  token:
    type: simple

This is deployed on GKE.

roy-work avatar Apr 17 '23 14:04 roy-work

I've managed to do this yet again, with a different instance of this chart.

I think what's happening here is that whatever logic this chart uses to decide whether to init a new cluster or to join an existing one is buggy. At the moment, I cannot tell what logic it uses to make that decision. The lack of any manual steps on my part, or a ConfigMap to hold that state, seems like a smell.

There also seems to be logic to attempt membership changes on pod termination, which also smells like a bug.

But, in the second instance of this, I've got on etcd-0 (which is crashlooping, so I've manually run etcd here):

{"level":"info","ts":"2023-04-24T22:40:46.430Z","caller":"membership/cluster.go:278","msg":"recovered/added member from store","cluster-id":"bfe80d27d57bd882","local-member-id":"76b45053198ab5ad","recovered-remote-peer-id":"76b45053198ab5ad","recovered-remote-peer-urls":["http://etcd-0.etcd-headless.dev.svc.cluster.local:2380"]}

And in etcd-1:

{"level":"info","ts":"2023-04-24T21:25:02.734Z","caller":"etcdserver/raft.go:529","msg":"restarting local member","cluster-id":"4dc26cfc1861673f","local-member-id":"e996d86b4a53601d","commit-index":4518}

And in etcd-2:

{"level":"info","ts":"2023-04-24T21:25:16.824Z","caller":"etcdserver/raft.go:529","msg":"restarting local member","cluster-id":"4dc26cfc1861673f","local-member-id":"9b369686ec9d586c","commit-index":4518}

My guess here then is that etcd-0 is crashloop / can't join b/c it's simply a "different" cluster, and safegaurds in etcd prevent such an error.

I need to see if the first instance of this (above) is the same way, but if it is … well, that's the problem.

roy-work avatar Apr 24 '23 23:04 roy-work

Eugh, the theory only works out for my second case. My first case, all the nodes claim to be in the same cluster, yet etcd cannot join (due to that "permanently removed" error).

roy-work avatar Apr 26 '23 00:04 roy-work

Hi @roy-work, sorry for my very late response.

I am trying to reproduce your issue. I'll share the steps I followed:

$ # Create a local cluster with 3 nodes.
$ k3d cluster create --agents 3
...
# Install etcd setting initialClusterState to new, as mentioned in https://github.com/bitnami/charts/issues/6251
$ helm install etcd bitnami/etcd --version 8.8.0 --set replicaCount=3 --set initialClusterState=new
NAME: etcd
LAST DEPLOYED: Wed May  3 14:15:59 2023
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
CHART NAME: etcd
CHART VERSION: 8.8.0
APP VERSION: 3.5.7
...
$ kubectl get po -o wide
NAME          READY   STATUS    RESTARTS   AGE   IP           NODE                       NOMINATED NODE   READINESS GATES
etcd-1   1/1     Running   0          88s   10.42.3.9    k3d-k3s-default-agent-1    <none>           <none>
etcd-0   1/1     Running   0          88s   10.42.2.8    k3d-k3s-default-agent-0    <none>           <none>
etcd-2   1/1     Running   0          88s   10.42.0.11   k3d-k3s-default-server-0   <none>           <none>
$ # Drain node k3d-k3s-default-agent-1  
$ k drain k3d-k3s-default-agent-1 --ignore-daemonsets --delete-emptydir-data
node/k3d-k3s-default-agent-1 already cordoned
Warning: ignoring DaemonSet-managed Pods: kube-system/svclb-traefik-8e0fdab3-hwgcn
evicting pod default/ectd-etcd-1
pod/etcd-1 evicted
node/k3d-k3s-default-agent-1 drained
$ # Now the etcd-1 pod is 'pending', waiting for the volume data-etcd-1
$ k describe pod etcd-1
...
  Type     Reason            Age    From               Message
  ----     ------            ----   ----               -------
  Warning  FailedScheduling  4m15s  default-scheduler  0/4 nodes are available: 2 node(s) had volume node affinity conflict, 2 node(s) were unschedulable. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling.

$ # Removing pvc as described in this issue and force a pod restart.
$ k delete pvc data-etcd-1
...
$ k delete po etcd-1
...
$ k get po
NAME          READY   STATUS             RESTARTS        AGE
etcd-0   1/1     Running            0               21m
etcd-2   1/1     Running            0               21m
etcd-1   0/1     CrashLoopBackOff   6 (2m22s ago)   9m3s

I am not sure if I am facing the same issue because the error messages are different. ~Could you share the values you are setting? Are you setting initialClusterState to new?~ Do you have any steps to reproduce your issue?

fmulero avatar May 03 '23 14:05 fmulero

I think this issue is similar to #15990. A temporary solution could be setting removeMemberOnContainerTermination: false

fmulero avatar May 05 '23 12:05 fmulero

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

github-actions[bot] avatar May 21 '23 01:05 github-actions[bot]

Yeah, I think it is similar to #15990. We had another instance of this today, where two pods are wedged in this state after eviction. It seems like there's a PreStop hook that removes the member from the cluster on pod termination …? But then when the pod starts back up, now it's stuck like this.

I'm having a hard time fathoming what the point of the PreStop hook is. What's the "happy path" scenario that it is trying to account for?

roy-work avatar May 23 '23 16:05 roy-work

We had another instance of this crop up, where 2 out of 3 nodes have "permanently removed" themselves from the cluster. The error message recommends deleting the data directory, but that doesn't work; it causes:

etcd-1 etcd etcd 18:03:06.76 INFO  ==> Adding new member to existing cluster
etcd-1 etcd etcd 18:03:09.05 WARN  ==> Cluster not healthy, not adding self to cluster for now, keeping trying...

This is rather frustrating: the cluster is healthy; it's only one node. If you consider all 3 nodes, then this is still frustrating: yes, only 1/3 is healthy, but we'd have 2/3 if this node would join.

The script appears to call is_healthy_etcd_cluster to check first if the cluster is healthy; AFAICT, this variable looks at all members of the STS, and concludes … stuff. In my case, it thinks the cluster isn't healthy, even though AFAICT it is.

There's also setup_etcd_active_endpoints, which … this also seems buggy. In the only remaining node, this returns:

I have no name!@etcd-2:/opt/bitnami/etcd$ setup_etcd_active_endpoints
0 2
I have no name!@etcd-2:/opt/bitnami/etcd$ echo $ETCD_ACTIVE_ENDPOINTS

Which seems wrong; both services have endpoints? (I have no idea what "active" is supposed to imply/mean; I presume it means "ready", in which case this is wrong. If I attempt to dump endpoints_array, that also seems wrong:

I have no name!@etcd-2:/opt/bitnami/etcd$ echo "${endpoints_array[@]}"
etcd-0.etcd-headless.dev.svc.cluster.local:2379 etcd-1.etcd-headless.dev.svc.cluster.local:2379

… where is the obvious etcd-2, as seen in the prompt?

roy-work avatar May 24 '23 01:05 roy-work

I understand your frustration. We are receiving a lot of issues like this one, IMHO there is something weird here. I am going to open a internal discussion. I hope to come back with news soon.

fmulero avatar May 24 '23 10:05 fmulero

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

github-actions[bot] avatar Jun 09 '23 01:06 github-actions[bot]

Due to the lack of activity in the last 5 days since it was marked as "stale", we proceed to close this Issue. Do not hesitate to reopen it later if necessary.

github-actions[bot] avatar Jun 15 '23 01:06 github-actions[bot]

I seemed to be facing the same issue as well. Are there any workarounds?

Chart Version: 9.4.1 3-node cluster

I have initialClusterState: existing and removeMemberOnContainerTermination: false set.

etcd-0 and etcd-1 are being removed from the cluster and gets stuck in a crash loop. However, etcd-2 seems to be working fine.

My issue came on separate occasion after these events:

  1. When the underlying K8s infra went down, then re-deploying etcd with the existing PVCs.
  2. When there is a sync problem on ArgoCD, taking down all my pods. Then re-enabling it.

I found that one way to replicate this issue might be to start up a 3-node cluster, disable it, then enable it again.

zirota avatar Sep 06 '23 03:09 zirota

Thank you for bringing this issue to our attention. We appreciate your involvement!

Unfortunately, we didn't have time to check it, the internal task is still on our backlog.

If you're interested in contributing a solution, we welcome you to create a pull request. The Bitnami team is excited to review your submission and offer feedback. You can find the contributing guidelines here. The source code of the container is present at https://github.com/bitnami/containers/tree/main/bitnami/etcd and the Helm chart logic can be found at https://github.com/bitnami/charts/tree/main/bitnami/etcd.

Your contribution will greatly benefit the community. Feel free to reach out if you have any questions or need assistance.

carrodher avatar Sep 06 '23 07:09 carrodher

Still seeing this

derekperkins avatar Feb 24 '24 14:02 derekperkins

Seeing this as well. Any update? Also anyone have a workaround?

gespi1 avatar Apr 11 '24 16:04 gespi1

Faced this issue myself, to me the solution was quite obvious:

It should be:

export ETCD_INITIAL_CLUSTER_STATE="existing"
export ETCD_NAME="milvus-etcd-0"
export ETCD_INITIAL_ADVERTISE_PEER_URLS="..."
export ETCD_INITIAL_CLUSTER="..."
etcd

instead of:

ETCD_INITIAL_CLUSTER_STATE="existing"
ETCD_NAME="milvus-etcd-0"
ETCD_INITIAL_ADVERTISE_PEER_URLS="..."
ETCD_INITIAL_CLUSTER="..."
etcd

To ensure these env vars are available for the sub-processes as well

punkerpunker avatar Apr 19 '24 16:04 punkerpunker

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

github-actions[bot] avatar May 09 '24 01:05 github-actions[bot]

Hi @punkerpunker thanks for your comment. Where is expected to add that code? Containers are currently exporting those env variables.

fmulero avatar May 09 '24 10:05 fmulero

Sorry about my very late response. I've just jump into this issue again and I am trying to reproduce the issue. Do you have clear steps to reproduce the problem? I tried following the instructions in this comment In this case not sure if removeMemberOnContainerTermination is set (enabled by default). In that case the pod is removed from the cluster in termination so it can not join again. I think this configuration is not very clear and it could be reason for most of the issues.

fmulero avatar May 09 '24 11:05 fmulero

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

github-actions[bot] avatar May 25 '24 01:05 github-actions[bot]

Due to the lack of activity in the last 5 days since it was marked as "stale", we proceed to close this Issue. Do not hesitate to reopen it later if necessary.

github-actions[bot] avatar May 30 '24 01:05 github-actions[bot]