help request: apisix unusable because etcd don't start
Description
Hi all, I have a 3 worker node (plus 1 master) K3S cluster with Apisix 2.15.1 installed as LoadBalancer using the helm chart
Every node is a KVM virtual machine on the same host.
After an host crash the three etcs pods never go online.
Looking at the first etcd pod (apisix-etcd-0) logs I see
{"level":"warn","ts":"2022-12-23T17:15:46.357Z","caller":"flags/flag.go:93","msg":"unrecognized environment variable","environment-variable":"ETCD_DAEMON_USER=etcd"}
{"level":"info","ts":"2022-12-23T17:15:46.357Z","caller":"etcdmain/etcd.go:73","msg":"Running: ","args":["etcd"]}
{"level":"warn","ts":"2022-12-23T17:15:46.357Z","caller":"etcdmain/etcd.go:446","msg":"found invalid file under data directory","filename":"member_id","data-dir":"/bitnami/etcd/data"}
{"level":"info","ts":"2022-12-23T17:15:46.357Z","caller":"etcdmain/etcd.go:116","msg":"server has been already initialized","data-dir":"/bitnami/etcd/data","dir-type":"member"}
{"level":"info","ts":"2022-12-23T17:15:46.357Z","caller":"embed/etcd.go:131","msg":"configuring peer listeners","listen-peer-urls":["http://0.0.0.0:2380"]}
{"level":"info","ts":"2022-12-23T17:15:46.357Z","caller":"embed/etcd.go:139","msg":"configuring client listeners","listen-client-urls":["http://0.0.0.0:2379"]}
{"level":"info","ts":"2022-12-23T17:15:46.358Z","caller":"embed/etcd.go:308","msg":"starting an etcd server","etcd-version":"3.5.4","git-sha":"08407ff76","go-version":"go1.16.15","go-os":"linux","go-arch":"amd64","max-cpu-set":6,"max-cpu-available":6,"member-initialized":true,"name":"apisix-etcd-0","data-dir":"/bitnami/etcd/data","wal-dir":"","wal-dir-dedicated":"","member-dir":"/bitnami/etcd/data/member","force-new-cluster":false,"heartbeat-interval":"100ms","election-timeout":"1s","initial-election-tick-advance":true,"snapshot-count":100000,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["http://apisix-etcd-0.apisix-etcd-headless.apisix.svc.cluster.local:2380"],"listen-peer-urls":["http://0.0.0.0:2380"],"advertise-client-urls":["http://apisix-etcd-0.apisix-etcd-headless.apisix.svc.cluster.local:2379","http://apisix-etcd.apisix.svc.cluster.local:2379"],"listen-client-urls":["http://0.0.0.0:2379"],"listen-metrics-urls":[],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"","initial-cluster-state":"new","initial-cluster-token":"","quota-size-bytes":2147483648,"pre-vote":true,"initial-corrupt-check":false,"corrupt-check-time-interval":"0s","auto-compaction-mode":"periodic","auto-compaction-retention":"0s","auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":"","downgrade-check-interval":"5s"}
{"level":"info","ts":"2022-12-23T17:15:46.358Z","caller":"etcdserver/backend.go:81","msg":"opened backend db","path":"/bitnami/etcd/data/member/snap/db","took":"159.119µs"}
{"level":"info","ts":"2022-12-23T17:15:46.473Z","caller":"etcdserver/server.go:508","msg":"recovered v2 store from snapshot","snapshot-index":200002,"snapshot-size":"26 kB"}
{"level":"warn","ts":"2022-12-23T17:15:46.474Z","caller":"snap/db.go:88","msg":"failed to find [SNAPSHOT-INDEX].snap.db","snapshot-index":200002,"snapshot-file-path":"/bitnami/etcd/data/member/snap/0000000000030d42.snap.db","error":"snap: snapshot file doesn't exist"}
{"level":"panic","ts":"2022-12-23T17:15:46.474Z","caller":"etcdserver/server.go:515","msg":"failed to recover v3 backend from snapshot","error":"failed to find database snapshot file (snap: snapshot file doesn't exist)","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.NewServer\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/server.go:515\ngo.etcd.io/etcd/server/v3/embed.StartEtcd\n\t/go/src/go.etcd.io/etcd/release/etcd/server/embed/etcd.go:245\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcd\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:228\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:123\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/main.go:40\nmain.main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/main.go:32\nruntime.main\n\t/go/gos/go1.16.15/src/runtime/proc.go:225"}
panic: failed to recover v3 backend from snapshot
goroutine 1 [running]:
How can I recover etcd?
UPDATE: workaround
To get rid of the corrupted etcd filesystem I followed these steps (in my use case):
- uninstall apisisx with
helm uninstall apisix -n apisix (this removes all apisix resources but not the corrupted filesystem)
- remove the etctd Physical Volumes
- remove the etcd Physical Volumes Claims
- install again apisix with (in my case)
helm install apisix apisix/apisix -f apisix-values.yaml \
--set ingress-controller.config.apisix.serviceNamespace=apisix \
--set ingress-controller.config.apisix.serviceName=apisix-admin \
--set ingress-controller.config.kubernetes.apisixRouteVersion=apisix.apache.org/v2beta3 \
--namespace apisix
With this I obtain a working apisix with all definitions lost
Environment
- APISIX version (run apisix version):
root@apisix-64fffcfb4c-55vhw:/usr/local/apisix# apisix version /usr/local/openresty/luajit/bin/luajit ./apisix/cli/apisix.lua version 2.15.1 root@apisix-64fffcfb4c-55vhw:/usr/local/apisix#
- Operating system (run uname -a):
root@apisix-64fffcfb4c-55vhw:/usr/local/apisix# uname -a Linux apisix-64fffcfb4c-55vhw 5.15.0-53-generic #59-Ubuntu SMP Mon Oct 17 18:53:30 UTC 2022 x86_64 GNU/Linux root@apisix-64fffcfb4c-55vhw:/usr/local/apisix#
- OpenResty / Nginx version (run openresty -V or nginx -V):
- etcd version, if relevant (run curl http://127.0.0.1:9090/v1/server_info): 8.3.4 (from the helm chart)
- APISIX Dashboard version, if relevant: 2.13.0
- Plugin runner version, for issues related to plugin runners:
- LuaRocks version, for installation issues (run luarocks --version):
Isn't this a completely unrelated generic problem? I mean, host crashes can happen anytime anywhere.
Hi @shreemaan-abhishek The crashes are "normal"; but in this case, the corrupted etcd requires a complete wipeout of the etcd volumes.
BTW moving the virtual machines vdisks from an Hard drive to an SSD solved the problem and I had no more etcd corruptions.
Moreover I'm currently using Apisix 3.X so I don't know if the problem is currently present
I don't think apisix can do anything to avoid data corruption. 🤔
This issue has been marked as stale due to 350 days of inactivity. It will be closed in 2 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the [email protected] list. Thank you for your contributions.
This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.