etcd alarms seem to persist beyond the removal of the member they belong to; additionally cannot clear alarms
What happened?
A cluster ran out of space, and was "rebuilt" - one instance was set with --force-new-cluster, and others were joined before that node itself was deleted and rejoined to the 3 member etcd cluster. However, we see the following log line frequently in the log, and as a result kubelet will eventually kill the etcd instance (the error shows up on all nodes.)
{"level":"warn","ts":"2022-08-25T00:57:44.423Z","caller":"etcdhttp/metrics.go:165","msg":"serving /health false due to an alarm","alarm":"memberID:<redacted, but mapped to a member that no longer exists> alarm:NOSPACE "}
Attempting to list or clear the alarms ends up timing out.
ETCDCTL_API=3 etcdctl --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key --endpoints=https://[
{"level":"warn","ts":"2022-08-25T03:08:36.185+0200","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-736942c2-272c-41e8-9268-398c832b8752/[
Other commands work as predicted -- i.e. member list returns the expected list of members. (none of which are that of which the alarms are thrown for)
Tried to set the timeouts a fair bit higher, but no luck. The cluster seems to be stuck in a state where I can't seem to create any keys even with a long timeout, but endpoint health returns healthy for all members in the list <5ms. We see no issues with disk performance, etcd_disk_wal_fsync_duration_seconds_bucket and etcd_disk_backend_commit_duration_seconds_bucket metrics from etcd both report ~800us on every node.
What did you expect to happen?
Ability to clear the alarms and not see alarms for etcd members that no longer exist.
How can we reproduce it (as minimally and precisely as possible)?
Anything else we need to know?
No response
Etcd version (please run commands below)
3.5.4
Etcd configuration (command line flags or environment variables)
paste your configuration here
Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)
$ etcdctl member list -w table
# paste output here
$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here
Relevant log output
No response
Another weird thing we've noticed - a lot of actions seem to time out at 7 seconds, but it's unclear exactly how that timeout is set --
{"level":"warn","ts":"2022-08-25T01:23:24.880Z","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2022-08-25T01:23:17.880Z","time spent":"7.000622172s","remote":"[::1]:33658","response type":"/etcdserverpb.KV/Txn","request count":0,"request size":0,"response count":0,"response size":0,"request content":""} {"level":"warn","ts":"2022-08-25T01:23:26.110Z","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2022-08-25T01:23:19.109Z","time spent":"7.001092567s","remote":"[::1]:33658","response type":"/etcdserverpb.KV/Txn","request count":0,"request size":0,"response count":0,"response size":0,"request content":""} {"level":"warn","ts":"2022-08-25T01:23:26.805Z","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2022-08-25T01:23:19.804Z","time spent":"7.000708634s","remote":"[::1]:60342","response type":"/etcdserverpb.KV/Txn","request count":0,"request size":0,"response count":0,"response size":0,"request content":""}
@gpl Could you provide detailed steps so that others can try to reproduce it locally?
@gpl Could you provide detailed steps so that others can try to reproduce it locally?
I was able to fix it after poking around in bbolt for a while; although it seems that restoring from snapshots also worked.
This seems to be either the exact issue or closely related: https://github.com/etcd-io/etcd/issues/14382
This seems to be either the exact issue or closely related: #14382
Did you see similar error log and panic below (copied from 14382)? I suggest you to provide detailed reproduce step so that we can double confirm.
{"level":"panic","ts":"2022-07-23T18:21:36.261Z","caller":"etcdserver/server.go:515","msg":"failed to recover v3 backend from snapshot","error":"failed to find database snapshot file (snap: snapshot file doesn't exist)","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.NewServer\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/server.go:515\ngo.etcd.io/etcd/server/v3/embed.StartEtcd\n\t/go/src/go.etcd.io/etcd/release/etcd/server/embed/etcd.go:245\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcd\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:228\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:123\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/main.go:40\nmain.main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/main.go:32\nruntime.main\n\t/go/gos/go1.16.15/src/runtime/proc.go:225"}
panic: failed to recover v3 backend from snapshot
goroutine 1 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc000506300, 0xc000532f40, 0x1, 0x1)
/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:234 +0x58d
go.uber.org/zap.(*Logger).Panic(0xc000010fa0, 0x1234726, 0x2a, 0xc000532f40, 0x1, 0x1)
/go/pkg/mod/go.uber.org/[email protected]/logger.go:227 +0x85
go.etcd.io/etcd/server/v3/etcdserver.NewServer(0xc0000460ca, 0xd, 0x0, 0x0, 0x0, 0x0, 0xc0005150e0, 0x2, 0x2, 0xc00053c480, ...)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/server.go:515 +0x1656
go.etcd.io/etcd/server/v3/embed.StartEtcd(0xc000104000, 0xc000104600, 0x0, 0x0)
/go/src/go.etcd.io/etcd/release/etcd/server/embed/etcd.go:245 +0xef8
go.etcd.io/etcd/server/v3/etcdmain.startEtcd(0xc000104000, 0x12089be, 0x6, 0xc0004b7201, 0x2)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:228 +0x32
go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2(0xc00003a1f0, 0x1, 0x1)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:123 +0x257a
go.etcd.io/etcd/server/v3/etcdmain.Main(0xc00003a1f0, 0x1, 0x1)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/main.go:40 +0x13f
The issue may be already resolved in https://github.com/etcd-io/etcd/pull/14419. Please reopen or raise a new ticket if you still see this issue in 3.5.5 or main branch.