etcd-issues
etcd-issues copied to clipboard
etcd crashes in EKS cluster
Following your article
helm search repo bitnami | grep etcd
bitnami/etcd 8.5.11 3.5.6 etcd is a distributed key-value store designed ...
I found the helm chart 8.5.11 provides etcd version 3.5.6. I upgraded my existing apisix installation by updating the version in charts.yaml :
helm dependency list ./charts/apisix
NAME VERSION REPOSITORY STATUS
etcd 8.5.11 https://charts.bitnami.com/bitnami ok
apisix-dashboard 0.6.1 https://charts.apiseven.com/ ok
apisix-ingress-controller 0.11.1 https://charts.apiseven.com/ ok
helm upgrade apisix ./charts/apisix --set gateway.type=LoadBalancer --set allow.ipList="{0.0.0.0/0}" --set ingress-controller.enabled=true --namespace ingress-apisix --set ingress-controller.config.apisix.serviceNamespace=ingress-apisix --set gateway.tls.enabled=true --set ingress-controller.config.apisix.adminKey=x --set admin.credentials.admin=xxxxx --set xxxx admin.credentials.viewer=xxxxx --set ingressController.config.apisix.baseURL=http://apisix-admin:9180/apisix/admin --set dashboard.enabled=true
However, etcd still crashes :
mk logs -f apisix-etcd-0
etcd 04:02:06.39
etcd 04:02:06.39 Welcome to the Bitnami etcd container
etcd 04:02:06.39 Subscribe to project updates by watching https://github.com/bitnami/containers
etcd 04:02:06.39 Submit issues and feature requests at https://github.com/bitnami/containers/issues
etcd 04:02:06.39
etcd 04:02:06.39 INFO ==> ** Starting etcd setup **
etcd 04:02:06.41 INFO ==> Validating settings in ETCD_* env vars..
etcd 04:02:06.41 WARN ==> You set the environment variable ALLOW_NONE_AUTHENTICATION=yes. For safety reasons, do not use this flag in a production environment.
etcd 04:02:06.41 INFO ==> Initializing etcd
etcd 04:02:06.41 INFO ==> Generating etcd config file using env variables
etcd 04:02:06.43 INFO ==> There is no data from previous deployments
etcd 04:02:06.44 INFO ==> Adding new member to existing cluster
etcd 04:02:16.59 WARN ==> Cluster not healthy, not adding self to cluster for now, keeping trying...
etcd 04:02:36.68 WARN ==> Cluster not healthy, not adding self to cluster for now, keeping trying...
etcd 04:02:56.76 WARN ==> Cluster not healthy, not adding self to cluster for now, keeping trying...
etcd 04:03:16.84 WARN ==> Cluster not healthy, not adding self to cluster for now, keeping trying...
etcd 04:03:36.91 WARN ==> Cluster not healthy, not adding self to cluster for now, keeping trying...
etcd 04:03:57.00 WARN ==> Cluster not healthy, not adding self to cluster for now, keeping trying...
etcd 04:04:17.08 WARN ==> Cluster not healthy, not adding self to cluster for now, keeping trying...
etcd 04:04:37.15 WARN ==> Cluster not healthy, not adding self to cluster for now, keeping trying...
etcd 04:04:57.27 WARN ==> Cluster not healthy, not adding self to cluster for now, keeping trying...
etcd 04:05:17.33 WARN ==> Cluster not healthy, not adding self to cluster for now, keeping trying...
etcd 04:05:37.43 WARN ==> Cluster not healthy, not adding self to cluster for now, keeping trying...
etcd 04:05:57.53 WARN ==> Cluster not healthy, not adding self to cluster for now, keeping trying...
These are events from the kubernetes cluster :
21m Warning FailedPreStopHook pod/apisix-etcd-0 Exec lifecycle hook ([/opt/bitnami/scripts/etcd/prestop.sh]) for Container "etcd" in Pod "apisix-etcd-0_ingress-apisix(a9c0fe68-6cec-4934-9934-43678e34977f)" failed - error: command '/opt/bitnami/scripts/etcd/prestop.sh' exited with 137: , message: ""
21m Warning FailedPreStopHook pod/apisix-etcd-1 Exec lifecycle hook ([/opt/bitnami/scripts/etcd/prestop.sh]) for Container "etcd" in Pod "apisix-etcd-1_ingress-apisix(0e9be7b4-c34a-4992-97cf-a8426766534a)" failed - error: command '/opt/bitnami/scripts/etcd/prestop.sh' exited with 137: , message: ""
20m Warning FailedMount pod/apisix-etcd-2 Unable to attach or mount volumes: unmounted volumes=[data etcd-jwt-token kube-api-access-jtccd], unattached volumes=[data etcd-jwt-token kube-api-access-jtccd]: timed out waiting for the condition
20m Normal NoPods poddisruptionbudget/apisix-etcd No matching pods found
20m Normal SuccessfulCreate statefulset/apisix-etcd create Claim data-apisix-etcd-1 Pod apisix-etcd-1 in StatefulSet apisix-etcd success
20m Normal WaitForFirstConsumer persistentvolumeclaim/data-apisix-etcd-0 waiting for first consumer to be created before binding
20m Normal WaitForFirstConsumer persistentvolumeclaim/data-apisix-etcd-1 waiting for first consumer to be created before binding
20m Normal SuccessfulCreate statefulset/apisix-etcd create Pod apisix-etcd-1 in StatefulSet apisix-etcd successful
20m Normal SuccessfulCreate statefulset/apisix-etcd create Pod apisix-etcd-0 in StatefulSet apisix-etcd successful
20m Normal SuccessfulCreate statefulset/apisix-etcd create Claim data-apisix-etcd-0 Pod apisix-etcd-0 in StatefulSet apisix-etcd success
20m Normal SuccessfulCreate statefulset/apisix-etcd create Pod apisix-etcd-2 in StatefulSet apisix-etcd successful
20m Normal WaitForFirstConsumer persistentvolumeclaim/data-apisix-etcd-2 waiting for first consumer to be created before binding
20m Normal SuccessfulCreate statefulset/apisix-etcd create Claim data-apisix-etcd-2 Pod apisix-etcd-2 in StatefulSet apisix-etcd success
20m Normal ProvisioningSucceeded persistentvolumeclaim/data-apisix-etcd-0 Successfully provisioned volume pvc-6b3d1b5c-b1a6-4bc0-9ff2-32de868e4cc7 using kubernetes.io/aws-ebs
20m Normal ProvisioningSucceeded persistentvolumeclaim/data-apisix-etcd-1 Successfully provisioned volume pvc-fbc18bc2-7e9a-4a2e-9311-bbd446a846de using kubernetes.io/aws-ebs
20m Normal ProvisioningSucceeded persistentvolumeclaim/data-apisix-etcd-2 Successfully provisioned volume pvc-3537fab5-57c6-4386-9d4a-3f623b2b4db3 using kubernetes.io/aws-ebs
20m Normal Scheduled pod/apisix-etcd-0 Successfully assigned ingress-apisix/apisix-etcd-0 to ip-172-31-110-110.ap-south-1.compute.internal
20m Normal Scheduled pod/apisix-etcd-1 Successfully assigned ingress-apisix/apisix-etcd-1 to ip-172-31-118-166.ap-south-1.compute.internal
20m Normal Scheduled pod/apisix-etcd-2 Successfully assigned ingress-apisix/apisix-etcd-2 to ip-172-31-102-32.ap-south-1.compute.internal
20m Normal SuccessfulAttachVolume pod/apisix-etcd-2 AttachVolume.Attach succeeded for volume "pvc-3537fab5-57c6-4386-9d4a-3f623b2b4db3"
20m Normal SuccessfulAttachVolume pod/apisix-etcd-0 AttachVolume.Attach succeeded for volume "pvc-6b3d1b5c-b1a6-4bc0-9ff2-32de868e4cc7"
20m Normal SuccessfulAttachVolume pod/apisix-etcd-1 AttachVolume.Attach succeeded for volume "pvc-fbc18bc2-7e9a-4a2e-9311-bbd446a846de"
20m Normal Started pod/apisix-etcd-0 Started container etcd
20m Normal Started pod/apisix-etcd-1 Started container etcd
20m Normal Pulled pod/apisix-etcd-1 Container image "docker.io/bitnami/etcd:3.5.6-debian-11-r10" already present on machine
20m Normal Created pod/apisix-etcd-1 Created container etcd
20m Normal Pulled pod/apisix-etcd-0 Container image "docker.io/bitnami/etcd:3.5.6-debian-11-r10" already present on machine
20m Normal Created pod/apisix-etcd-0 Created container etcd
20m Normal Pulled pod/apisix-etcd-2 Container image "docker.io/bitnami/etcd:3.5.6-debian-11-r10" already present on machine
20m Normal Created pod/apisix-etcd-2 Created container etcd
20m Normal Started pod/apisix-etcd-2 Started container etcd
5m1s Warning Unhealthy pod/apisix-etcd-2 Readiness probe failed:
5m2s Warning Unhealthy pod/apisix-etcd-0 Readiness probe failed:
5m1s Warning Unhealthy pod/apisix-etcd-1 Readiness probe failed:
16m Warning Unhealthy pod/apisix-etcd-2 Liveness probe failed:
16m Warning Unhealthy pod/apisix-etcd-1 Liveness probe failed:
16m Warning Unhealthy pod/apisix-etcd-0 Liveness probe failed:
16m Warning FailedPreStopHook pod/apisix-etcd-0 Exec lifecycle hook ([/opt/bitnami/scripts/etcd/prestop.sh]) for Container "etcd" in Pod "apisix-etcd-0_ingress-apisix(3ed8842f-ec58-47d4-b315-ce8a79328578)" failed - error: command '/opt/bitnami/scripts/etcd/prestop.sh' exited with 128: Error: bad member ID arg (strconv.ParseUint: parsing "": invalid syntax), expecting ID in Hex...
16m Normal Killing pod/apisix-etcd-2 Container etcd failed liveness probe, will be restarted
16m Warning FailedPreStopHook pod/apisix-etcd-2 Exec lifecycle hook ([/opt/bitnami/scripts/etcd/prestop.sh]) for Container "etcd" in Pod "apisix-etcd-2_ingress-apisix(9ee4b468-5674-4dfb-8bf4-dca0b964bbe0)" failed - error: command '/opt/bitnami/scripts/etcd/prestop.sh' exited with 128: Error: bad member ID arg (strconv.ParseUint: parsing "": invalid syntax), expecting ID in Hex...
16m Warning FailedPreStopHook pod/apisix-etcd-1 Exec lifecycle hook ([/opt/bitnami/scripts/etcd/prestop.sh]) for Container "etcd" in Pod "apisix-etcd-1_ingress-apisix(6fb9dc93-aabc-4f3a-99d3-6ba10bf3e040)" failed - error: command '/opt/bitnami/scripts/etcd/prestop.sh' exited with 128: Error: bad member ID arg (strconv.ParseUint: parsing "": invalid syntax), expecting ID in Hex...
16m Normal Killing pod/apisix-etcd-1 Container etcd failed liveness probe, will be restarted
16m Normal Killing pod/apisix-etcd-0 Container etcd failed liveness probe, will be restarted
Pod status :
apisix-etcd-0 0/1 Running 5 (2m53s ago) 23m
apisix-etcd-1 0/1 Running 5 (2m53s ago) 23m
apisix-etcd-2 0/1 Running 5 (2m53s ago) 23m
mk describe pod/apisix-etcd-0
Name: apisix-etcd-0
Namespace: ingress-apisix
Priority: 0
Node: ip-172-31-110-110.ap-south-1.compute.internal/172.31.110.110
Start Time: Tue, 10 Jan 2023 09:28:00 +0530
Labels: app.kubernetes.io/instance=apisix
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=etcd
controller-revision-hash=apisix-etcd-648486db84
helm.sh/chart=etcd-8.5.11
statefulset.kubernetes.io/pod-name=apisix-etcd-0
Annotations: checksum/token-secret: f0fcd4104dce3cb310d3f003076edf43dc81011716f2cdcc405202be9ceb3434
kubernetes.io/psp: eks.privileged
Status: Running
IP: 172.31.110.254
IPs:
IP: 172.31.110.254
Controlled By: StatefulSet/apisix-etcd
Containers:
etcd:
Container ID: docker://937cac1eaa863b75423d0f2ecf21f0e22a9dd9cbe9cd1f6ea708bda9606ade57
Image: docker.io/bitnami/etcd:3.5.6-debian-11-r10
Image ID: docker-pullable://bitnami/etcd@sha256:2d7b831769734bb97a5c1cfd2fe46e29f422b70b5ba9f9aedfd91300839ac3ee
Ports: 2379/TCP, 2380/TCP
Host Ports: 0/TCP, 0/TCP
State: Running
Started: Tue, 10 Jan 2023 09:52:06 +0530
Last State: Terminated
Reason: Error
Exit Code: 137
Started: Tue, 10 Jan 2023 09:48:06 +0530
Finished: Tue, 10 Jan 2023 09:52:06 +0530
Ready: False
Restart Count: 6
Liveness: exec [/opt/bitnami/scripts/etcd/healthcheck.sh] delay=60s timeout=5s period=30s #success=1 #failure=5
Readiness: exec [/opt/bitnami/scripts/etcd/healthcheck.sh] delay=60s timeout=5s period=10s #success=1 #failure=5
Environment:
BITNAMI_DEBUG: false
MY_POD_IP: (v1:status.podIP)
MY_POD_NAME: apisix-etcd-0 (v1:metadata.name)
MY_STS_NAME: apisix-etcd
ETCDCTL_API: 3
ETCD_ON_K8S: yes
ETCD_START_FROM_SNAPSHOT: no
ETCD_DISASTER_RECOVERY: no
ETCD_NAME: $(MY_POD_NAME)
ETCD_DATA_DIR: /bitnami/etcd/data
ETCD_LOG_LEVEL: info
ALLOW_NONE_AUTHENTICATION: yes
ETCD_AUTH_TOKEN: jwt,priv-key=/opt/bitnami/etcd/certs/token/jwt-token.pem,sign-method=RS256,ttl=10m
ETCD_ADVERTISE_CLIENT_URLS: http://$(MY_POD_NAME).apisix-etcd-headless.ingress-apisix.svc.cluster.local:2379,http://apisix-etcd.ingress-apisix.svc.cluster.local:2379
ETCD_LISTEN_CLIENT_URLS: http://0.0.0.0:2379
ETCD_INITIAL_ADVERTISE_PEER_URLS: http://$(MY_POD_NAME).apisix-etcd-headless.ingress-apisix.svc.cluster.local:2380
ETCD_LISTEN_PEER_URLS: http://0.0.0.0:2380
ETCD_INITIAL_CLUSTER_TOKEN: etcd-cluster-k8s
ETCD_INITIAL_CLUSTER_STATE: existing
ETCD_INITIAL_CLUSTER: apisix-etcd-0=http://apisix-etcd-0.apisix-etcd-headless.ingress-apisix.svc.cluster.local:2380,apisix-etcd-1=http://apisix-etcd-1.apisix-etcd-headless.ingress-apisix.svc.cluster.local:2380,apisix-etcd-2=http://apisix-etcd-2.apisix-etcd-headless.ingress-apisix.svc.cluster.local:2380
ETCD_CLUSTER_DOMAIN: apisix-etcd-headless.ingress-apisix.svc.cluster.local
Mounts:
/bitnami/etcd from data (rw)
/opt/bitnami/etcd/certs/token/ from etcd-jwt-token (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-h9wdm (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
data:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: data-apisix-etcd-0
ReadOnly: false
etcd-jwt-token:
Type: Secret (a volume populated by a Secret)
SecretName: apisix-etcd-jwt-token
Optional: false
kube-api-access-h9wdm:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 25m default-scheduler Successfully assigned ingress-apisix/apisix-etcd-0 to ip-172-31-110-110.ap-south-1.compute.internal
Normal SuccessfulAttachVolume 24m attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-6b3d1b5c-b1a6-4bc0-9ff2-32de868e4cc7"
Normal Pulled 24m kubelet Container image "docker.io/bitnami/etcd:3.5.6-debian-11-r10" already present on machine
Normal Created 24m kubelet Created container etcd
Normal Started 24m kubelet Started container etcd
Warning Unhealthy 21m (x5 over 23m) kubelet Liveness probe failed:
Normal Killing 21m kubelet Container etcd failed liveness probe, will be restarted
Warning FailedPreStopHook 21m kubelet Exec lifecycle hook ([/opt/bitnami/scripts/etcd/prestop.sh]) for Container "etcd" in Pod "apisix-etcd-0_ingress-apisix(3ed8842f-ec58-47d4-b315-ce8a79328578)" failed - error: command '/opt/bitnami/scripts/etcd/prestop.sh' exited with 128: Error: bad member ID arg (strconv.ParseUint: parsing "": invalid syntax), expecting ID in Hex
, message: "Error: bad member ID arg (strconv.ParseUint: parsing \"\": invalid syntax), expecting ID in Hex\n"
Warning Unhealthy 3m45s (x91 over 23m) kubelet Readiness probe failed:
How can we get the etcd cluster to work?
mk exec apisix-etcd-0 -it /bin/bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
I have no name!@apisix-etcd-0:/opt/bitnami/etcd$ etcdctl member list -w table
{"level":"warn","ts":"2023-01-10T04:25:14.832Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000362700/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
Error: context deadline exceeded
I have no name!@apisix-etcd-0:/opt/bitnami/etcd$ etcdctl endpoint status -w table --cluster
{"level":"warn","ts":"2023-01-10T04:25:27.685Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0003a08c0/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
Error: failed to fetch endpoints from etcd cluster member list: context deadline exceeded
I have no name!@apisix-etcd-0:/opt/bitnami/etcd$
Based on the error message, it looks like the memberID used in the Liveness probe isn't correct.
Normal Killing 21m kubelet Container etcd failed liveness probe, will be restarted
Warning FailedPreStopHook 21m kubelet Exec lifecycle hook ([/opt/bitnami/scripts/etcd/prestop.sh]) for Container "etcd" in Pod "apisix-etcd-0_ingress-apisix(3ed8842f-ec58-47d4-b315-ce8a79328578)" failed - error: command '/opt/bitnami/scripts/etcd/prestop.sh' exited with 128: Error: bad member ID arg (strconv.ParseUint: parsing "": invalid syntax), expecting ID in Hex
, message: "Error: bad member ID arg (strconv.ParseUint: parsing \"\": invalid syntax), expecting ID in Hex\n"
Yes, it happens automatically, and this is why the etcd crashes.