Index out of range on etcd single node start-up
Currently running into an index out of range while starting up etcd after a restart of my node. etcd dies after this and the issue persists after it restarts. Any advice on this?
docker logs 763136fa1953
[WARNING] Deprecated '--debug' flag is set to true (use '--log-level=debug' instead
[WARNING] Deprecated '--debug' flag is set to true with inconsistent '--log-level=info' flag
[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2020-07-22 16:50:06.917341 I | etcdmain: etcd Version: 3.4.3
2020-07-22 16:50:06.917370 I | etcdmain: Git SHA: 3cf2f69b5
2020-07-22 16:50:06.917373 I | etcdmain: Go Version: go1.12.12
2020-07-22 16:50:06.917375 I | etcdmain: Go OS/Arch: linux/amd64
2020-07-22 16:50:06.917378 I | etcdmain: setting maximum number of CPUs to 4, total number of available CPUs is 4
2020-07-22 16:50:06.917419 N | etcdmain: the server is already initialized as member before, starting as etcd member...
[WARNING] Deprecated '--debug' flag is set to true (use '--log-level=debug' instead
[WARNING] Deprecated '--debug' flag is set to true with inconsistent '--log-level=info' flag
[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2020-07-22 16:50:06.917448 I | embed: peerTLS: cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = true, crl-file =
2020-07-22 16:50:06.917986 I | embed: name = mini-master
2020-07-22 16:50:06.917993 I | embed: data dir = /var/lib/etcd
2020-07-22 16:50:06.917997 I | embed: member dir = /var/lib/etcd/member
2020-07-22 16:50:06.917999 I | embed: heartbeat = 100ms
2020-07-22 16:50:06.918002 I | embed: election = 1000ms
2020-07-22 16:50:06.918004 I | embed: snapshot count = 10000
2020-07-22 16:50:06.918011 I | embed: advertise client URLs = https://192.168.20.23:2379
2020-07-22 16:50:06.918015 I | embed: initial advertise peer URLs = https://192.168.20.23:2380
2020-07-22 16:50:06.918019 I | embed: initial cluster =
panic: runtime error: index out of range
goroutine 160 [running]:
go.etcd.io/bbolt.(*DB).page(...)
/home/ec2-user/go/pkg/mod/go.etcd.io/[email protected]/db.go:886
go.etcd.io/bbolt.(*Tx).page(...)
/home/ec2-user/go/pkg/mod/go.etcd.io/[email protected]/tx.go:620
go.etcd.io/bbolt.(*Tx).forEachPage(0xc0000ea380, 0x80000000df, 0x1, 0xc00024f1e8)
/home/ec2-user/go/pkg/mod/go.etcd.io/[email protected]/tx.go:625 +0x13a
go.etcd.io/bbolt.(*Tx).forEachPage(0xc0000ea380, 0x7e, 0x0, 0xc00024f1e8)
/home/ec2-user/go/pkg/mod/go.etcd.io/[email protected]/tx.go:634 +0xd4
go.etcd.io/bbolt.(*Tx).checkBucket(0xc0000ea380, 0xc00025d900, 0xc00024f4b8, 0xc00024f488, 0xc0000847e0)
/home/ec2-user/go/pkg/mod/go.etcd.io/[email protected]/tx.go:465 +0xd4
go.etcd.io/bbolt.(*Tx).checkBucket.func2(0x7fe59c2ba1b9, 0x3, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0)
/home/ec2-user/go/pkg/mod/go.etcd.io/[email protected]/tx.go:490 +0xc0
go.etcd.io/bbolt.(*Bucket).ForEach(0xc0000ea398, 0xc00024f370, 0x0, 0xc00024f340)
/home/ec2-user/go/pkg/mod/go.etcd.io/[email protected]/bucket.go:388 +0xfe
go.etcd.io/bbolt.(*Tx).checkBucket(0xc0000ea380, 0xc0000ea398, 0xc00024f4b8, 0xc00024f488, 0xc0000847e0)
/home/ec2-user/go/pkg/mod/go.etcd.io/[email protected]/tx.go:488 +0x146
go.etcd.io/bbolt.(*DB).freepages(0xc0000e4600, 0x0, 0x0, 0x0)
/home/ec2-user/go/pkg/mod/go.etcd.io/[email protected]/db.go:1006 +0x1f8
go.etcd.io/bbolt.(*DB).loadFreelist.func1()
/home/ec2-user/go/pkg/mod/go.etcd.io/[email protected]/db.go:313 +0x10e
sync.(*Once).Do(0xc0000e4768, 0xc00024f600)
/usr/local/go/src/sync/once.go:44 +0xb3
go.etcd.io/bbolt.(*DB).loadFreelist(0xc0000e4600)
/home/ec2-user/go/pkg/mod/go.etcd.io/[email protected]/db.go:309 +0x56
go.etcd.io/bbolt.Open(0xc00029a340, 0x1c, 0x180, 0xc0000948c8, 0x44402c, 0x100000000000020, 0x0)
/home/ec2-user/go/pkg/mod/go.etcd.io/[email protected]/db.go:286 +0x326
go.etcd.io/etcd/mvcc/backend.newBackend(0xc00029a340, 0x1c, 0x5f5e100, 0x2710, 0x1082f11, 0x5, 0x280000000, 0x0, 0x0)
/tmp/etcd-release-3.4.3/etcd/release/etcd/mvcc/backend/backend.go:154 +0xfe
go.etcd.io/etcd/mvcc/backend.New(...)
/tmp/etcd-release-3.4.3/etcd/release/etcd/mvcc/backend/backend.go:137
go.etcd.io/etcd/etcdserver.newBackend(0x7ffee5344e67, 0xb, 0x0, 0x0, 0x0, 0x0, 0xc0000ef280, 0x1, 0x1, 0xc0000ef400, ...)
/tmp/etcd-release-3.4.3/etcd/release/etcd/etcdserver/backend.go:52 +0x1a0
go.etcd.io/etcd/etcdserver.openBackend.func1(0xc000084780, 0xc00029c2c0)
/tmp/etcd-release-3.4.3/etcd/release/etcd/etcdserver/backend.go:73 +0x95
created by go.etcd.io/etcd/etcdserver.openBackend
/tmp/etcd-release-3.4.3/etcd/release/etcd/etcdserver/backend.go:72 +0x110
Could you please patch this PR into your local client of bbolt:
https://github.com/etcd-io/bbolt/pull/225
And run ./cmd/main check on the /member/snap/db file.
I created the expanded check because I was debugging issue with the same syndromes. In my case it was corrupted btree invariant.
do you know what would cause the etcd data to get corrupted?
I didn't managed to repro it or root reason (despite pretty deep analysis of bbolt code) . As it was one-off at large scale I assumed "cosmic rays" that caused some memory corruption.
In my case there were 2 symptoms:
- change of key index on branch page. The page used to have proper index in parent page but at random transaction it got the one of previous page.
- the file size has grown for ~14MB without good reason (many empty pages)
It's why I got interested whether your corruption is of the same type.
@ptabor how would i go about running your patch on a running kubernetes cluster?
So it failed once and latter kept running ? In my case it was permanent corruption, so it's likely a different case.
You can copy the file 'live' from the etcd directory or use etcdctl snapschot command to download the snapshot. But if it's live and not crashing I don't expect the checker to find anything.
sorry, i meant the cluster is not "running". I'm unable to reach the kubernetes cluster since etcd is crashing, but some k8s containers are still running on docker. @ptabor
panic: runtime error: index out of range
goroutine 160 [running]: go.etcd.io/bbolt.(*DB).page(...) /home/ec2-user/go/pkg/mod/go.etcd.io/[email protected]/db.go:886 go.etcd.io/bbolt.(*Tx).page(...) /home/ec2-user/go/pkg/mod/go.etcd.io/[email protected]/tx.go:620
The reason should be that the pageId was somehow corrupted ( should be a huge value in this case). db.go#L884