etcd icon indicating copy to clipboard operation
etcd copied to clipboard

etcd alarms seem to persist beyond the removal of the member they belong to; additionally cannot clear alarms

Open gpl opened this issue 3 years ago • 4 comments

What happened?

A cluster ran out of space, and was "rebuilt" - one instance was set with --force-new-cluster, and others were joined before that node itself was deleted and rejoined to the 3 member etcd cluster. However, we see the following log line frequently in the log, and as a result kubelet will eventually kill the etcd instance (the error shows up on all nodes.)

{"level":"warn","ts":"2022-08-25T00:57:44.423Z","caller":"etcdhttp/metrics.go:165","msg":"serving /health false due to an alarm","alarm":"memberID:<redacted, but mapped to a member that no longer exists> alarm:NOSPACE "}

Attempting to list or clear the alarms ends up timing out.

ETCDCTL_API=3 etcdctl --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key --endpoints=https://[]:2379 alarm list

{"level":"warn","ts":"2022-08-25T03:08:36.185+0200","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-736942c2-272c-41e8-9268-398c832b8752/[]:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"} Error: context deadline exceeded

Other commands work as predicted -- i.e. member list returns the expected list of members. (none of which are that of which the alarms are thrown for)

Tried to set the timeouts a fair bit higher, but no luck. The cluster seems to be stuck in a state where I can't seem to create any keys even with a long timeout, but endpoint health returns healthy for all members in the list <5ms. We see no issues with disk performance, etcd_disk_wal_fsync_duration_seconds_bucket and etcd_disk_backend_commit_duration_seconds_bucket metrics from etcd both report ~800us on every node.

What did you expect to happen?

Ability to clear the alarms and not see alarms for etcd members that no longer exist.

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

No response

Etcd version (please run commands below)

3.5.4

Etcd configuration (command line flags or environment variables)

paste your configuration here

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here

Relevant log output

No response

gpl avatar Aug 25 '22 01:08 gpl

Another weird thing we've noticed - a lot of actions seem to time out at 7 seconds, but it's unclear exactly how that timeout is set --

{"level":"warn","ts":"2022-08-25T01:23:24.880Z","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2022-08-25T01:23:17.880Z","time spent":"7.000622172s","remote":"[::1]:33658","response type":"/etcdserverpb.KV/Txn","request count":0,"request size":0,"response count":0,"response size":0,"request content":""} {"level":"warn","ts":"2022-08-25T01:23:26.110Z","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2022-08-25T01:23:19.109Z","time spent":"7.001092567s","remote":"[::1]:33658","response type":"/etcdserverpb.KV/Txn","request count":0,"request size":0,"response count":0,"response size":0,"request content":""} {"level":"warn","ts":"2022-08-25T01:23:26.805Z","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2022-08-25T01:23:19.804Z","time spent":"7.000708634s","remote":"[::1]:60342","response type":"/etcdserverpb.KV/Txn","request count":0,"request size":0,"response count":0,"response size":0,"request content":""}

gpl avatar Aug 25 '22 01:08 gpl

@gpl Could you provide detailed steps so that others can try to reproduce it locally?

ahrtr avatar Sep 07 '22 08:09 ahrtr

@gpl Could you provide detailed steps so that others can try to reproduce it locally?

I was able to fix it after poking around in bbolt for a while; although it seems that restoring from snapshots also worked.

This seems to be either the exact issue or closely related: https://github.com/etcd-io/etcd/issues/14382

gpl avatar Sep 20 '22 19:09 gpl

This seems to be either the exact issue or closely related: #14382

Did you see similar error log and panic below (copied from 14382)? I suggest you to provide detailed reproduce step so that we can double confirm.

{"level":"panic","ts":"2022-07-23T18:21:36.261Z","caller":"etcdserver/server.go:515","msg":"failed to recover v3 backend from snapshot","error":"failed to find database snapshot file (snap: snapshot file doesn't exist)","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.NewServer\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/server.go:515\ngo.etcd.io/etcd/server/v3/embed.StartEtcd\n\t/go/src/go.etcd.io/etcd/release/etcd/server/embed/etcd.go:245\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcd\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:228\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:123\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/main.go:40\nmain.main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/main.go:32\nruntime.main\n\t/go/gos/go1.16.15/src/runtime/proc.go:225"}
panic: failed to recover v3 backend from snapshot
goroutine 1 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc000506300, 0xc000532f40, 0x1, 0x1)
/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:234 +0x58d
go.uber.org/zap.(*Logger).Panic(0xc000010fa0, 0x1234726, 0x2a, 0xc000532f40, 0x1, 0x1)
/go/pkg/mod/go.uber.org/[email protected]/logger.go:227 +0x85
go.etcd.io/etcd/server/v3/etcdserver.NewServer(0xc0000460ca, 0xd, 0x0, 0x0, 0x0, 0x0, 0xc0005150e0, 0x2, 0x2, 0xc00053c480, ...)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/server.go:515 +0x1656
go.etcd.io/etcd/server/v3/embed.StartEtcd(0xc000104000, 0xc000104600, 0x0, 0x0)
/go/src/go.etcd.io/etcd/release/etcd/server/embed/etcd.go:245 +0xef8
go.etcd.io/etcd/server/v3/etcdmain.startEtcd(0xc000104000, 0x12089be, 0x6, 0xc0004b7201, 0x2)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:228 +0x32
go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2(0xc00003a1f0, 0x1, 0x1)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:123 +0x257a
go.etcd.io/etcd/server/v3/etcdmain.Main(0xc00003a1f0, 0x1, 0x1)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/main.go:40 +0x13f

ahrtr avatar Sep 20 '22 21:09 ahrtr

The issue may be already resolved in https://github.com/etcd-io/etcd/pull/14419. Please reopen or raise a new ticket if you still see this issue in 3.5.5 or main branch.

ahrtr avatar Sep 30 '22 21:09 ahrtr