etcd-issues etcd crashes in EKS cluster

Following your article

helm search repo bitnami  | grep etcd
bitnami/etcd                                	8.5.11       	3.5.6        	etcd is a distributed key-value store designed ...

I found the helm chart 8.5.11 provides etcd version 3.5.6. I upgraded my existing apisix installation by updating the version in charts.yaml :

helm dependency list ./charts/apisix
NAME                     	VERSION	REPOSITORY                        	STATUS
etcd                     	8.5.11 	https://charts.bitnami.com/bitnami	ok    
apisix-dashboard         	0.6.1  	https://charts.apiseven.com/       	ok    
apisix-ingress-controller	0.11.1 	https://charts.apiseven.com/       	ok

helm upgrade apisix ./charts/apisix --set gateway.type=LoadBalancer --set allow.ipList="{0.0.0.0/0}" --set ingress-controller.enabled=true --namespace ingress-apisix --set ingress-controller.config.apisix.serviceNamespace=ingress-apisix --set gateway.tls.enabled=true --set ingress-controller.config.apisix.adminKey=x --set admin.credentials.admin=xxxxx --set xxxx admin.credentials.viewer=xxxxx --set ingressController.config.apisix.baseURL=http://apisix-admin:9180/apisix/admin --set dashboard.enabled=true

However, etcd still crashes :

mk logs -f apisix-etcd-0
etcd 04:02:06.39 
etcd 04:02:06.39 Welcome to the Bitnami etcd container
etcd 04:02:06.39 Subscribe to project updates by watching https://github.com/bitnami/containers
etcd 04:02:06.39 Submit issues and feature requests at https://github.com/bitnami/containers/issues
etcd 04:02:06.39 
etcd 04:02:06.39 INFO  ==> ** Starting etcd setup **
etcd 04:02:06.41 INFO  ==> Validating settings in ETCD_* env vars..
etcd 04:02:06.41 WARN  ==> You set the environment variable ALLOW_NONE_AUTHENTICATION=yes. For safety reasons, do not use this flag in a production environment.
etcd 04:02:06.41 INFO  ==> Initializing etcd
etcd 04:02:06.41 INFO  ==> Generating etcd config file using env variables
etcd 04:02:06.43 INFO  ==> There is no data from previous deployments
etcd 04:02:06.44 INFO  ==> Adding new member to existing cluster
etcd 04:02:16.59 WARN  ==> Cluster not healthy, not adding self to cluster for now, keeping trying...
etcd 04:02:36.68 WARN  ==> Cluster not healthy, not adding self to cluster for now, keeping trying...
etcd 04:02:56.76 WARN  ==> Cluster not healthy, not adding self to cluster for now, keeping trying...
etcd 04:03:16.84 WARN  ==> Cluster not healthy, not adding self to cluster for now, keeping trying...
etcd 04:03:36.91 WARN  ==> Cluster not healthy, not adding self to cluster for now, keeping trying...
etcd 04:03:57.00 WARN  ==> Cluster not healthy, not adding self to cluster for now, keeping trying...
etcd 04:04:17.08 WARN  ==> Cluster not healthy, not adding self to cluster for now, keeping trying...
etcd 04:04:37.15 WARN  ==> Cluster not healthy, not adding self to cluster for now, keeping trying...
etcd 04:04:57.27 WARN  ==> Cluster not healthy, not adding self to cluster for now, keeping trying...
etcd 04:05:17.33 WARN  ==> Cluster not healthy, not adding self to cluster for now, keeping trying...
etcd 04:05:37.43 WARN  ==> Cluster not healthy, not adding self to cluster for now, keeping trying...
etcd 04:05:57.53 WARN  ==> Cluster not healthy, not adding self to cluster for now, keeping trying...

These are events from the kubernetes cluster :

21m         Warning   FailedPreStopHook                 pod/apisix-etcd-0                              Exec lifecycle hook ([/opt/bitnami/scripts/etcd/prestop.sh]) for Container "etcd" in Pod "apisix-etcd-0_ingress-apisix(a9c0fe68-6cec-4934-9934-43678e34977f)" failed - error: command '/opt/bitnami/scripts/etcd/prestop.sh' exited with 137: , message: ""
21m         Warning   FailedPreStopHook                 pod/apisix-etcd-1                              Exec lifecycle hook ([/opt/bitnami/scripts/etcd/prestop.sh]) for Container "etcd" in Pod "apisix-etcd-1_ingress-apisix(0e9be7b4-c34a-4992-97cf-a8426766534a)" failed - error: command '/opt/bitnami/scripts/etcd/prestop.sh' exited with 137: , message: ""
20m         Warning   FailedMount                       pod/apisix-etcd-2                              Unable to attach or mount volumes: unmounted volumes=[data etcd-jwt-token kube-api-access-jtccd], unattached volumes=[data etcd-jwt-token kube-api-access-jtccd]: timed out waiting for the condition
20m         Normal    NoPods                            poddisruptionbudget/apisix-etcd                No matching pods found
20m         Normal    SuccessfulCreate                  statefulset/apisix-etcd                        create Claim data-apisix-etcd-1 Pod apisix-etcd-1 in StatefulSet apisix-etcd success
20m         Normal    WaitForFirstConsumer              persistentvolumeclaim/data-apisix-etcd-0       waiting for first consumer to be created before binding
20m         Normal    WaitForFirstConsumer              persistentvolumeclaim/data-apisix-etcd-1       waiting for first consumer to be created before binding
20m         Normal    SuccessfulCreate                  statefulset/apisix-etcd                        create Pod apisix-etcd-1 in StatefulSet apisix-etcd successful
20m         Normal    SuccessfulCreate                  statefulset/apisix-etcd                        create Pod apisix-etcd-0 in StatefulSet apisix-etcd successful
20m         Normal    SuccessfulCreate                  statefulset/apisix-etcd                        create Claim data-apisix-etcd-0 Pod apisix-etcd-0 in StatefulSet apisix-etcd success
20m         Normal    SuccessfulCreate                  statefulset/apisix-etcd                        create Pod apisix-etcd-2 in StatefulSet apisix-etcd successful
20m         Normal    WaitForFirstConsumer              persistentvolumeclaim/data-apisix-etcd-2       waiting for first consumer to be created before binding
20m         Normal    SuccessfulCreate                  statefulset/apisix-etcd                        create Claim data-apisix-etcd-2 Pod apisix-etcd-2 in StatefulSet apisix-etcd success
20m         Normal    ProvisioningSucceeded             persistentvolumeclaim/data-apisix-etcd-0       Successfully provisioned volume pvc-6b3d1b5c-b1a6-4bc0-9ff2-32de868e4cc7 using kubernetes.io/aws-ebs
20m         Normal    ProvisioningSucceeded             persistentvolumeclaim/data-apisix-etcd-1       Successfully provisioned volume pvc-fbc18bc2-7e9a-4a2e-9311-bbd446a846de using kubernetes.io/aws-ebs
20m         Normal    ProvisioningSucceeded             persistentvolumeclaim/data-apisix-etcd-2       Successfully provisioned volume pvc-3537fab5-57c6-4386-9d4a-3f623b2b4db3 using kubernetes.io/aws-ebs
20m         Normal    Scheduled                         pod/apisix-etcd-0                              Successfully assigned ingress-apisix/apisix-etcd-0 to ip-172-31-110-110.ap-south-1.compute.internal
20m         Normal    Scheduled                         pod/apisix-etcd-1                              Successfully assigned ingress-apisix/apisix-etcd-1 to ip-172-31-118-166.ap-south-1.compute.internal
20m         Normal    Scheduled                         pod/apisix-etcd-2                              Successfully assigned ingress-apisix/apisix-etcd-2 to ip-172-31-102-32.ap-south-1.compute.internal
20m         Normal    SuccessfulAttachVolume            pod/apisix-etcd-2                              AttachVolume.Attach succeeded for volume "pvc-3537fab5-57c6-4386-9d4a-3f623b2b4db3"
20m         Normal    SuccessfulAttachVolume            pod/apisix-etcd-0                              AttachVolume.Attach succeeded for volume "pvc-6b3d1b5c-b1a6-4bc0-9ff2-32de868e4cc7"
20m         Normal    SuccessfulAttachVolume            pod/apisix-etcd-1                              AttachVolume.Attach succeeded for volume "pvc-fbc18bc2-7e9a-4a2e-9311-bbd446a846de"
20m         Normal    Started                           pod/apisix-etcd-0                              Started container etcd
20m         Normal    Started                           pod/apisix-etcd-1                              Started container etcd
20m         Normal    Pulled                            pod/apisix-etcd-1                              Container image "docker.io/bitnami/etcd:3.5.6-debian-11-r10" already present on machine
20m         Normal    Created                           pod/apisix-etcd-1                              Created container etcd
20m         Normal    Pulled                            pod/apisix-etcd-0                              Container image "docker.io/bitnami/etcd:3.5.6-debian-11-r10" already present on machine
20m         Normal    Created                           pod/apisix-etcd-0                              Created container etcd
20m         Normal    Pulled                            pod/apisix-etcd-2                              Container image "docker.io/bitnami/etcd:3.5.6-debian-11-r10" already present on machine
20m         Normal    Created                           pod/apisix-etcd-2                              Created container etcd
20m         Normal    Started                           pod/apisix-etcd-2                              Started container etcd
5m1s        Warning   Unhealthy                         pod/apisix-etcd-2                              Readiness probe failed:
5m2s        Warning   Unhealthy                         pod/apisix-etcd-0                              Readiness probe failed:
5m1s        Warning   Unhealthy                         pod/apisix-etcd-1                              Readiness probe failed:
16m         Warning   Unhealthy                         pod/apisix-etcd-2                              Liveness probe failed:
16m         Warning   Unhealthy                         pod/apisix-etcd-1                              Liveness probe failed:
16m         Warning   Unhealthy                         pod/apisix-etcd-0                              Liveness probe failed:
16m         Warning   FailedPreStopHook                 pod/apisix-etcd-0                              Exec lifecycle hook ([/opt/bitnami/scripts/etcd/prestop.sh]) for Container "etcd" in Pod "apisix-etcd-0_ingress-apisix(3ed8842f-ec58-47d4-b315-ce8a79328578)" failed - error: command '/opt/bitnami/scripts/etcd/prestop.sh' exited with 128: Error: bad member ID arg (strconv.ParseUint: parsing "": invalid syntax), expecting ID in Hex...
16m         Normal    Killing                           pod/apisix-etcd-2                              Container etcd failed liveness probe, will be restarted
16m         Warning   FailedPreStopHook                 pod/apisix-etcd-2                              Exec lifecycle hook ([/opt/bitnami/scripts/etcd/prestop.sh]) for Container "etcd" in Pod "apisix-etcd-2_ingress-apisix(9ee4b468-5674-4dfb-8bf4-dca0b964bbe0)" failed - error: command '/opt/bitnami/scripts/etcd/prestop.sh' exited with 128: Error: bad member ID arg (strconv.ParseUint: parsing "": invalid syntax), expecting ID in Hex...
16m         Warning   FailedPreStopHook                 pod/apisix-etcd-1                              Exec lifecycle hook ([/opt/bitnami/scripts/etcd/prestop.sh]) for Container "etcd" in Pod "apisix-etcd-1_ingress-apisix(6fb9dc93-aabc-4f3a-99d3-6ba10bf3e040)" failed - error: command '/opt/bitnami/scripts/etcd/prestop.sh' exited with 128: Error: bad member ID arg (strconv.ParseUint: parsing "": invalid syntax), expecting ID in Hex...
16m         Normal    Killing                           pod/apisix-etcd-1                              Container etcd failed liveness probe, will be restarted
16m         Normal    Killing                           pod/apisix-etcd-0                              Container etcd failed liveness probe, will be restarted

Pod status :


apisix-etcd-0                                0/1     Running   5 (2m53s ago)   23m
apisix-etcd-1                                0/1     Running   5 (2m53s ago)   23m
apisix-etcd-2                                0/1     Running   5 (2m53s ago)   23m

 mk describe pod/apisix-etcd-0
Name:         apisix-etcd-0
Namespace:    ingress-apisix
Priority:     0
Node:         ip-172-31-110-110.ap-south-1.compute.internal/172.31.110.110
Start Time:   Tue, 10 Jan 2023 09:28:00 +0530
Labels:       app.kubernetes.io/instance=apisix
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=etcd
              controller-revision-hash=apisix-etcd-648486db84
              helm.sh/chart=etcd-8.5.11
              statefulset.kubernetes.io/pod-name=apisix-etcd-0
Annotations:  checksum/token-secret: f0fcd4104dce3cb310d3f003076edf43dc81011716f2cdcc405202be9ceb3434
              kubernetes.io/psp: eks.privileged
Status:       Running
IP:           172.31.110.254
IPs:
  IP:           172.31.110.254
Controlled By:  StatefulSet/apisix-etcd
Containers:
  etcd:
    Container ID:   docker://937cac1eaa863b75423d0f2ecf21f0e22a9dd9cbe9cd1f6ea708bda9606ade57
    Image:          docker.io/bitnami/etcd:3.5.6-debian-11-r10
    Image ID:       docker-pullable://bitnami/etcd@sha256:2d7b831769734bb97a5c1cfd2fe46e29f422b70b5ba9f9aedfd91300839ac3ee
    Ports:          2379/TCP, 2380/TCP
    Host Ports:     0/TCP, 0/TCP
    State:          Running
      Started:      Tue, 10 Jan 2023 09:52:06 +0530
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Tue, 10 Jan 2023 09:48:06 +0530
      Finished:     Tue, 10 Jan 2023 09:52:06 +0530
    Ready:          False
    Restart Count:  6
    Liveness:       exec [/opt/bitnami/scripts/etcd/healthcheck.sh] delay=60s timeout=5s period=30s #success=1 #failure=5
    Readiness:      exec [/opt/bitnami/scripts/etcd/healthcheck.sh] delay=60s timeout=5s period=10s #success=1 #failure=5
    Environment:
      BITNAMI_DEBUG:                     false
      MY_POD_IP:                          (v1:status.podIP)
      MY_POD_NAME:                       apisix-etcd-0 (v1:metadata.name)
      MY_STS_NAME:                       apisix-etcd
      ETCDCTL_API:                       3
      ETCD_ON_K8S:                       yes
      ETCD_START_FROM_SNAPSHOT:          no
      ETCD_DISASTER_RECOVERY:            no
      ETCD_NAME:                         $(MY_POD_NAME)
      ETCD_DATA_DIR:                     /bitnami/etcd/data
      ETCD_LOG_LEVEL:                    info
      ALLOW_NONE_AUTHENTICATION:         yes
      ETCD_AUTH_TOKEN:                   jwt,priv-key=/opt/bitnami/etcd/certs/token/jwt-token.pem,sign-method=RS256,ttl=10m
      ETCD_ADVERTISE_CLIENT_URLS:        http://$(MY_POD_NAME).apisix-etcd-headless.ingress-apisix.svc.cluster.local:2379,http://apisix-etcd.ingress-apisix.svc.cluster.local:2379
      ETCD_LISTEN_CLIENT_URLS:           http://0.0.0.0:2379
      ETCD_INITIAL_ADVERTISE_PEER_URLS:  http://$(MY_POD_NAME).apisix-etcd-headless.ingress-apisix.svc.cluster.local:2380
      ETCD_LISTEN_PEER_URLS:             http://0.0.0.0:2380
      ETCD_INITIAL_CLUSTER_TOKEN:        etcd-cluster-k8s
      ETCD_INITIAL_CLUSTER_STATE:        existing
      ETCD_INITIAL_CLUSTER:              apisix-etcd-0=http://apisix-etcd-0.apisix-etcd-headless.ingress-apisix.svc.cluster.local:2380,apisix-etcd-1=http://apisix-etcd-1.apisix-etcd-headless.ingress-apisix.svc.cluster.local:2380,apisix-etcd-2=http://apisix-etcd-2.apisix-etcd-headless.ingress-apisix.svc.cluster.local:2380
      ETCD_CLUSTER_DOMAIN:               apisix-etcd-headless.ingress-apisix.svc.cluster.local
    Mounts:
      /bitnami/etcd from data (rw)
      /opt/bitnami/etcd/certs/token/ from etcd-jwt-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-h9wdm (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  data-apisix-etcd-0
    ReadOnly:   false
  etcd-jwt-token:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  apisix-etcd-jwt-token
    Optional:    false
  kube-api-access-h9wdm:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason                  Age                From                     Message
  ----     ------                  ----               ----                     -------
  Normal   Scheduled               25m                default-scheduler        Successfully assigned ingress-apisix/apisix-etcd-0 to ip-172-31-110-110.ap-south-1.compute.internal
  Normal   SuccessfulAttachVolume  24m                attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-6b3d1b5c-b1a6-4bc0-9ff2-32de868e4cc7"
  Normal   Pulled                  24m                kubelet                  Container image "docker.io/bitnami/etcd:3.5.6-debian-11-r10" already present on machine
  Normal   Created                 24m                kubelet                  Created container etcd
  Normal   Started                 24m                kubelet                  Started container etcd
  Warning  Unhealthy               21m (x5 over 23m)  kubelet                  Liveness probe failed:
  Normal   Killing                 21m                kubelet                  Container etcd failed liveness probe, will be restarted
  Warning  FailedPreStopHook       21m                kubelet                  Exec lifecycle hook ([/opt/bitnami/scripts/etcd/prestop.sh]) for Container "etcd" in Pod "apisix-etcd-0_ingress-apisix(3ed8842f-ec58-47d4-b315-ce8a79328578)" failed - error: command '/opt/bitnami/scripts/etcd/prestop.sh' exited with 128: Error: bad member ID arg (strconv.ParseUint: parsing "": invalid syntax), expecting ID in Hex
, message: "Error: bad member ID arg (strconv.ParseUint: parsing \"\": invalid syntax), expecting ID in Hex\n"
  Warning  Unhealthy  3m45s (x91 over 23m)  kubelet  Readiness probe failed:

Jan 10 '23 04:01 jishaashokan

How can we get the etcd cluster to work?

mk exec apisix-etcd-0 -it /bin/bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
I have no name!@apisix-etcd-0:/opt/bitnami/etcd$ etcdctl member list -w table
{"level":"warn","ts":"2023-01-10T04:25:14.832Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000362700/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
Error: context deadline exceeded
I have no name!@apisix-etcd-0:/opt/bitnami/etcd$ etcdctl  endpoint status -w table --cluster
{"level":"warn","ts":"2023-01-10T04:25:27.685Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0003a08c0/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
Error: failed to fetch endpoints from etcd cluster member list: context deadline exceeded
I have no name!@apisix-etcd-0:/opt/bitnami/etcd$

Jan 10 '23 04:01 jishaashokan

Based on the error message, it looks like the memberID used in the Liveness probe isn't correct.

  Normal   Killing                 21m                kubelet                  Container etcd failed liveness probe, will be restarted
  Warning  FailedPreStopHook       21m                kubelet                  Exec lifecycle hook ([/opt/bitnami/scripts/etcd/prestop.sh]) for Container "etcd" in Pod "apisix-etcd-0_ingress-apisix(3ed8842f-ec58-47d4-b315-ce8a79328578)" failed - error: command '/opt/bitnami/scripts/etcd/prestop.sh' exited with 128: Error: bad member ID arg (strconv.ParseUint: parsing "": invalid syntax), expecting ID in Hex
, message: "Error: bad member ID arg (strconv.ParseUint: parsing \"\": invalid syntax), expecting ID in Hex\n"

Jan 13 '23 01:01 ahrtr

Yes, it happens automatically, and this is why the etcd crashes.

Jan 13 '23 05:01 jishaashokan

etcd-issues etcd-issues copied to clipboard

etcd crashes in EKS cluster

etcd-issues
etcd-issues copied to clipboard