bbolt Index out of range on etcd single node start-up

Currently running into an index out of range while starting up etcd after a restart of my node. etcd dies after this and the issue persists after it restarts. Any advice on this?

docker logs 763136fa1953
[WARNING] Deprecated '--debug' flag is set to true (use '--log-level=debug' instead
[WARNING] Deprecated '--debug' flag is set to true with inconsistent '--log-level=info' flag
[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2020-07-22 16:50:06.917341 I | etcdmain: etcd Version: 3.4.3
2020-07-22 16:50:06.917370 I | etcdmain: Git SHA: 3cf2f69b5
2020-07-22 16:50:06.917373 I | etcdmain: Go Version: go1.12.12
2020-07-22 16:50:06.917375 I | etcdmain: Go OS/Arch: linux/amd64
2020-07-22 16:50:06.917378 I | etcdmain: setting maximum number of CPUs to 4, total number of available CPUs is 4
2020-07-22 16:50:06.917419 N | etcdmain: the server is already initialized as member before, starting as etcd member...
[WARNING] Deprecated '--debug' flag is set to true (use '--log-level=debug' instead
[WARNING] Deprecated '--debug' flag is set to true with inconsistent '--log-level=info' flag
[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2020-07-22 16:50:06.917448 I | embed: peerTLS: cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = true, crl-file =
2020-07-22 16:50:06.917986 I | embed: name = mini-master
2020-07-22 16:50:06.917993 I | embed: data dir = /var/lib/etcd
2020-07-22 16:50:06.917997 I | embed: member dir = /var/lib/etcd/member
2020-07-22 16:50:06.917999 I | embed: heartbeat = 100ms
2020-07-22 16:50:06.918002 I | embed: election = 1000ms
2020-07-22 16:50:06.918004 I | embed: snapshot count = 10000
2020-07-22 16:50:06.918011 I | embed: advertise client URLs = https://192.168.20.23:2379
2020-07-22 16:50:06.918015 I | embed: initial advertise peer URLs = https://192.168.20.23:2380
2020-07-22 16:50:06.918019 I | embed: initial cluster =
panic: runtime error: index out of range

goroutine 160 [running]:
go.etcd.io/bbolt.(*DB).page(...)
	/home/ec2-user/go/pkg/mod/go.etcd.io/[email protected]/db.go:886
go.etcd.io/bbolt.(*Tx).page(...)
	/home/ec2-user/go/pkg/mod/go.etcd.io/[email protected]/tx.go:620
go.etcd.io/bbolt.(*Tx).forEachPage(0xc0000ea380, 0x80000000df, 0x1, 0xc00024f1e8)
	/home/ec2-user/go/pkg/mod/go.etcd.io/[email protected]/tx.go:625 +0x13a
go.etcd.io/bbolt.(*Tx).forEachPage(0xc0000ea380, 0x7e, 0x0, 0xc00024f1e8)
	/home/ec2-user/go/pkg/mod/go.etcd.io/[email protected]/tx.go:634 +0xd4
go.etcd.io/bbolt.(*Tx).checkBucket(0xc0000ea380, 0xc00025d900, 0xc00024f4b8, 0xc00024f488, 0xc0000847e0)
	/home/ec2-user/go/pkg/mod/go.etcd.io/[email protected]/tx.go:465 +0xd4
go.etcd.io/bbolt.(*Tx).checkBucket.func2(0x7fe59c2ba1b9, 0x3, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0)
	/home/ec2-user/go/pkg/mod/go.etcd.io/[email protected]/tx.go:490 +0xc0
go.etcd.io/bbolt.(*Bucket).ForEach(0xc0000ea398, 0xc00024f370, 0x0, 0xc00024f340)
	/home/ec2-user/go/pkg/mod/go.etcd.io/[email protected]/bucket.go:388 +0xfe
go.etcd.io/bbolt.(*Tx).checkBucket(0xc0000ea380, 0xc0000ea398, 0xc00024f4b8, 0xc00024f488, 0xc0000847e0)
	/home/ec2-user/go/pkg/mod/go.etcd.io/[email protected]/tx.go:488 +0x146
go.etcd.io/bbolt.(*DB).freepages(0xc0000e4600, 0x0, 0x0, 0x0)
	/home/ec2-user/go/pkg/mod/go.etcd.io/[email protected]/db.go:1006 +0x1f8
go.etcd.io/bbolt.(*DB).loadFreelist.func1()
	/home/ec2-user/go/pkg/mod/go.etcd.io/[email protected]/db.go:313 +0x10e
sync.(*Once).Do(0xc0000e4768, 0xc00024f600)
	/usr/local/go/src/sync/once.go:44 +0xb3
go.etcd.io/bbolt.(*DB).loadFreelist(0xc0000e4600)
	/home/ec2-user/go/pkg/mod/go.etcd.io/[email protected]/db.go:309 +0x56
go.etcd.io/bbolt.Open(0xc00029a340, 0x1c, 0x180, 0xc0000948c8, 0x44402c, 0x100000000000020, 0x0)
	/home/ec2-user/go/pkg/mod/go.etcd.io/[email protected]/db.go:286 +0x326
go.etcd.io/etcd/mvcc/backend.newBackend(0xc00029a340, 0x1c, 0x5f5e100, 0x2710, 0x1082f11, 0x5, 0x280000000, 0x0, 0x0)
	/tmp/etcd-release-3.4.3/etcd/release/etcd/mvcc/backend/backend.go:154 +0xfe
go.etcd.io/etcd/mvcc/backend.New(...)
	/tmp/etcd-release-3.4.3/etcd/release/etcd/mvcc/backend/backend.go:137
go.etcd.io/etcd/etcdserver.newBackend(0x7ffee5344e67, 0xb, 0x0, 0x0, 0x0, 0x0, 0xc0000ef280, 0x1, 0x1, 0xc0000ef400, ...)
	/tmp/etcd-release-3.4.3/etcd/release/etcd/etcdserver/backend.go:52 +0x1a0
go.etcd.io/etcd/etcdserver.openBackend.func1(0xc000084780, 0xc00029c2c0)
	/tmp/etcd-release-3.4.3/etcd/release/etcd/etcdserver/backend.go:73 +0x95
created by go.etcd.io/etcd/etcdserver.openBackend
	/tmp/etcd-release-3.4.3/etcd/release/etcd/etcdserver/backend.go:72 +0x110

Jul 22 '20 17:07 azye

Could you please patch this PR into your local client of bbolt:
https://github.com/etcd-io/bbolt/pull/225

And run ./cmd/main check on the /member/snap/db file.

I created the expanded check because I was debugging issue with the same syndromes. In my case it was corrupted btree invariant.

Jul 22 '20 18:07 ptabor

do you know what would cause the etcd data to get corrupted?

Jul 22 '20 23:07 azye

I didn't managed to repro it or root reason (despite pretty deep analysis of bbolt code) . As it was one-off at large scale I assumed "cosmic rays" that caused some memory corruption.

In my case there were 2 symptoms:

change of key index on branch page. The page used to have proper index in parent page but at random transaction it got the one of previous page.
the file size has grown for ~14MB without good reason (many empty pages)

It's why I got interested whether your corruption is of the same type.

Jul 23 '20 07:07 ptabor

@ptabor how would i go about running your patch on a running kubernetes cluster?

Jul 23 '20 18:07 azye

So it failed once and latter kept running ? In my case it was permanent corruption, so it's likely a different case.

You can copy the file 'live' from the etcd directory or use etcdctl snapschot command to download the snapshot. But if it's live and not crashing I don't expect the checker to find anything.

Jul 23 '20 18:07 ptabor

sorry, i meant the cluster is not "running". I'm unable to reach the kubernetes cluster since etcd is crashing, but some k8s containers are still running on docker. @ptabor

Jul 23 '20 18:07 azye

panic: runtime error: index out of range

goroutine 160 [running]: go.etcd.io/bbolt.(*DB).page(...) /home/ec2-user/go/pkg/mod/go.etcd.io/[email protected]/db.go:886 go.etcd.io/bbolt.(*Tx).page(...) /home/ec2-user/go/pkg/mod/go.etcd.io/[email protected]/tx.go:620

The reason should be that the pageId was somehow corrupted ( should be a huge value in this case). db.go#L884

May 26 '23 06:05 ahrtr