apisix help request: apisix unusable because etcd don't start

Description

Hi all, I have a 3 worker node (plus 1 master) K3S cluster with Apisix 2.15.1 installed as LoadBalancer using the helm chart

Every node is a KVM virtual machine on the same host.

After an host crash the three etcs pods never go online.

Looking at the first etcd pod (apisix-etcd-0) logs I see

{"level":"warn","ts":"2022-12-23T17:15:46.357Z","caller":"flags/flag.go:93","msg":"unrecognized environment variable","environment-variable":"ETCD_DAEMON_USER=etcd"}
{"level":"info","ts":"2022-12-23T17:15:46.357Z","caller":"etcdmain/etcd.go:73","msg":"Running: ","args":["etcd"]}
{"level":"warn","ts":"2022-12-23T17:15:46.357Z","caller":"etcdmain/etcd.go:446","msg":"found invalid file under data directory","filename":"member_id","data-dir":"/bitnami/etcd/data"}
{"level":"info","ts":"2022-12-23T17:15:46.357Z","caller":"etcdmain/etcd.go:116","msg":"server has been already initialized","data-dir":"/bitnami/etcd/data","dir-type":"member"}
{"level":"info","ts":"2022-12-23T17:15:46.357Z","caller":"embed/etcd.go:131","msg":"configuring peer listeners","listen-peer-urls":["http://0.0.0.0:2380"]}
{"level":"info","ts":"2022-12-23T17:15:46.357Z","caller":"embed/etcd.go:139","msg":"configuring client listeners","listen-client-urls":["http://0.0.0.0:2379"]}
{"level":"info","ts":"2022-12-23T17:15:46.358Z","caller":"embed/etcd.go:308","msg":"starting an etcd server","etcd-version":"3.5.4","git-sha":"08407ff76","go-version":"go1.16.15","go-os":"linux","go-arch":"amd64","max-cpu-set":6,"max-cpu-available":6,"member-initialized":true,"name":"apisix-etcd-0","data-dir":"/bitnami/etcd/data","wal-dir":"","wal-dir-dedicated":"","member-dir":"/bitnami/etcd/data/member","force-new-cluster":false,"heartbeat-interval":"100ms","election-timeout":"1s","initial-election-tick-advance":true,"snapshot-count":100000,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["http://apisix-etcd-0.apisix-etcd-headless.apisix.svc.cluster.local:2380"],"listen-peer-urls":["http://0.0.0.0:2380"],"advertise-client-urls":["http://apisix-etcd-0.apisix-etcd-headless.apisix.svc.cluster.local:2379","http://apisix-etcd.apisix.svc.cluster.local:2379"],"listen-client-urls":["http://0.0.0.0:2379"],"listen-metrics-urls":[],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"","initial-cluster-state":"new","initial-cluster-token":"","quota-size-bytes":2147483648,"pre-vote":true,"initial-corrupt-check":false,"corrupt-check-time-interval":"0s","auto-compaction-mode":"periodic","auto-compaction-retention":"0s","auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":"","downgrade-check-interval":"5s"}
{"level":"info","ts":"2022-12-23T17:15:46.358Z","caller":"etcdserver/backend.go:81","msg":"opened backend db","path":"/bitnami/etcd/data/member/snap/db","took":"159.119µs"}
{"level":"info","ts":"2022-12-23T17:15:46.473Z","caller":"etcdserver/server.go:508","msg":"recovered v2 store from snapshot","snapshot-index":200002,"snapshot-size":"26 kB"}
{"level":"warn","ts":"2022-12-23T17:15:46.474Z","caller":"snap/db.go:88","msg":"failed to find [SNAPSHOT-INDEX].snap.db","snapshot-index":200002,"snapshot-file-path":"/bitnami/etcd/data/member/snap/0000000000030d42.snap.db","error":"snap: snapshot file doesn't exist"}
{"level":"panic","ts":"2022-12-23T17:15:46.474Z","caller":"etcdserver/server.go:515","msg":"failed to recover v3 backend from snapshot","error":"failed to find database snapshot file (snap: snapshot file doesn't exist)","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.NewServer\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/server.go:515\ngo.etcd.io/etcd/server/v3/embed.StartEtcd\n\t/go/src/go.etcd.io/etcd/release/etcd/server/embed/etcd.go:245\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcd\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:228\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:123\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/main.go:40\nmain.main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/main.go:32\nruntime.main\n\t/go/gos/go1.16.15/src/runtime/proc.go:225"}
panic: failed to recover v3 backend from snapshot
goroutine 1 [running]:

How can I recover etcd?

UPDATE: workaround

To get rid of the corrupted etcd filesystem I followed these steps (in my use case):

uninstall apisisx with

helm uninstall apisix -n apisix (this removes all apisix resources but not the corrupted filesystem)

remove the etctd Physical Volumes
remove the etcd Physical Volumes Claims
install again apisix with (in my case)

helm install apisix apisix/apisix -f apisix-values.yaml \
--set ingress-controller.config.apisix.serviceNamespace=apisix \
--set ingress-controller.config.apisix.serviceName=apisix-admin \
--set ingress-controller.config.kubernetes.apisixRouteVersion=apisix.apache.org/v2beta3 \
--namespace apisix

With this I obtain a working apisix with all definitions lost

Environment

APISIX version (run apisix version):

root@apisix-64fffcfb4c-55vhw:/usr/local/apisix# apisix version /usr/local/openresty/luajit/bin/luajit ./apisix/cli/apisix.lua version 2.15.1 root@apisix-64fffcfb4c-55vhw:/usr/local/apisix#

Operating system (run uname -a):

root@apisix-64fffcfb4c-55vhw:/usr/local/apisix# uname -a Linux apisix-64fffcfb4c-55vhw 5.15.0-53-generic #59-Ubuntu SMP Mon Oct 17 18:53:30 UTC 2022 x86_64 GNU/Linux root@apisix-64fffcfb4c-55vhw:/usr/local/apisix#

OpenResty / Nginx version (run openresty -V or nginx -V):
etcd version, if relevant (run curl http://127.0.0.1:9090/v1/server_info): 8.3.4 (from the helm chart)
APISIX Dashboard version, if relevant: 2.13.0
Plugin runner version, for issues related to plugin runners:
LuaRocks version, for installation issues (run luarocks --version):

Dec 23 '22 17:12 MirtoBusico

Isn't this a completely unrelated generic problem? I mean, host crashes can happen anytime anywhere.

Sep 06 '23 11:09 shreemaan-abhishek

Hi @shreemaan-abhishek The crashes are "normal"; but in this case, the corrupted etcd requires a complete wipeout of the etcd volumes.

BTW moving the virtual machines vdisks from an Hard drive to an SSD solved the problem and I had no more etcd corruptions.

Moreover I'm currently using Apisix 3.X so I don't know if the problem is currently present

Sep 06 '23 18:09 MirtoBusico

I don't think apisix can do anything to avoid data corruption. 🤔

Sep 07 '23 02:09 shreemaan-abhishek

This issue has been marked as stale due to 350 days of inactivity. It will be closed in 2 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the [email protected] list. Thank you for your contributions.

Aug 22 '24 10:08 github-actions[bot]

This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.

Sep 06 '24 10:09 github-actions[bot]