etcd
etcd copied to clipboard
Handle disk corruption
What happened?
With etcd 3.5.3, running with --experimental-initial-corrupt-check
, starting etcd with disk files that have had a few (p(flip) =~ 0.001) random bit flips can cause all kinds of exciting behavior. Sometimes the corruption check detects the checksum mismatch and panics properly:
{"level":"panic","ts":"2022-06-09T13:27:38.646-0400","caller":"backend/backend.go:189","msg":"failed to open database","path":"n1.etcd/member/snap/db","error":"checksum error","stacktrace":"go.etcd.io/etcd/server/v3/mvcc/backend.newBackend\n\t/go/src/go.etcd.io/etcd/release/etcd/server/mvcc/backend/backend.go:189\ngo.etcd.io/etcd/server/v3/mvcc/backend.New\n\t/go/src/go.etcd.io/etcd/release/etcd/server/mvcc/backend/backend.go:163\ngo.etcd.io/etcd/server/v3/etcdserver.newBackend\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/backend.go:55\ngo.etcd.io/etcd/server/v3/etcdserver.openBackend.func1\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/backend.go:76"}
panic: failed to open database
Or
{"level":"fatal","ts":"2022-06-09T13:37:43.837-0400","caller":"etcdmain/etcd.go:204","msg":"discovery failed","error":"walpb: crc mismatch","stacktrace":"go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:204\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/main.go:40\nmain.main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/main.go:32\nruntime.main\n\t/go/gos/go1.16.15/src/runtime/proc.go:225"}
However, we can also induce SIGSEGV and SIGBUS crashes. Here's an example Jepsen run with full logs and tarballs of the data dir available if you'd like to see for yourself. Here's a segfault:
{"level":"info","ts":"2022-06-09T13:38:32.345-0400","caller":"embed/etcd.go:308","msg":"starting an etcd server","etcd-version":"3.5.3","git-sha":"0452feec7","go-version":"go1.16.15","go-os":"linux","go-arch":"amd64","max-cpu-set":48,"max-cpu-available":48,"member-initialized":true,"name":"n2","data-dir":"n2.etcd","wal-dir":"","wal-dir-dedicated":"","member-dir":"n2.etcd/member","force-new-cluster":false,"heartbeat-interval":"100ms","election-timeout":"1s","initial-election-tick-advance":true,"snapshot-count":100,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["http://192.168.122.102:2380"],"listen-peer-urls":["http://192.168.122.102:2380"],"advertise-client-urls":["http://192.168.122.102:2379"],"listen-client-urls":["http://192.168.122.102:2379"],"listen-metrics-urls":[],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"","initial-cluster-state":"existing","initial-cluster-token":"","quota-size-bytes":2147483648,"pre-vote":true,"initial-corrupt-check":true,"corrupt-check-time-interval":"1m0s","auto-compaction-mode":"periodic","auto-compaction-retention":"0s","auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":"","downgrade-check-interval":"5s"}
unexpected fault address 0x7f73f56db000
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x1 addr=0x7f73f56db000 pc=0x9df702]
goroutine 197 [running]:
runtime.throw(0x120762d, 0x5)
/go/gos/go1.16.15/src/runtime/panic.go:1117 +0x72 fp=0xc00001ff80 sp=0xc00001ff50 pc=0x4385b2
runtime.sigpanic()
/go/gos/go1.16.15/src/runtime/signal_unix.go:741 +0x268 fp=0xc00001ffb8 sp=0xc00001ff80 pc=0x44fdc8
go.etcd.io/bbolt.(*Tx).checkBucket.func1(0x7f73f56db000, 0x0)
/go/pkg/mod/go.etcd.io/[email protected]/tx.go:465 +0x62 fp=0xc000020080 sp=0xc00001ffb8 pc=0x9df702
go.etcd.io/bbolt.(*Tx).forEachPage(0xc0004c41c0, 0x2000000, 0x0, 0xc000020118)
/go/pkg/mod/go.etcd.io/[email protected]/tx.go:625 +0x89 fp=0xc0000200c8 sp=0xc000020080 pc=0x9ddf89
go.etcd.io/bbolt.(*Tx).checkBucket(0xc0004c41c0, 0xc0000e8480, 0xc0000203c0, 0xc000020390, 0xc0006ac180)
/go/pkg/mod/go.etcd.io/[email protected]/tx.go:464 +0xd4 fp=0xc000020158 sp=0xc0000200c8 pc=0x9dd6b4
go.etcd.io/bbolt.(*Tx).checkBucket.func2(0x7f53f56df0d5, 0x4, 0x4, 0x0, 0x0, 0x0, 0x0, 0x0)
/go/pkg/mod/go.etcd.io/[email protected]/tx.go:489 +0xc5 fp=0xc0000201b0 sp=0xc000020158 pc=0x9dfc45
go.etcd.io/bbolt.(*Bucket).ForEach(0xc0004c41d8, 0xc000020240, 0x0, 0xc000020270)
/go/pkg/mod/go.etcd.io/[email protected]/bucket.go:390 +0x103 fp=0xc000020220 sp=0xc0000201b0 pc=0x9ca883
go.etcd.io/bbolt.(*Tx).checkBucket(0xc0004c41c0, 0xc0004c41d8, 0xc0000203c0, 0xc000020390, 0xc0006ac180)
/go/pkg/mod/go.etcd.io/[email protected]/tx.go:487 +0x146 fp=0xc0000202b0 sp=0xc000020220 pc=0x9dd726
go.etcd.io/bbolt.(*DB).freepages(0xc0000fe480, 0x0, 0x0, 0x0)
/go/pkg/mod/go.etcd.io/[email protected]/db.go:1059 +0x210 fp=0xc0000204a8 sp=0xc0000202b0 pc=0x9d2390
go.etcd.io/bbolt.(*DB).loadFreelist.func1()
/go/pkg/mod/go.etcd.io/[email protected]/db.go:320 +0x114 fp=0xc0000204e0 sp=0xc0000204a8 pc=0x9dee14
sync.(*Once).doSlow(0xc0000fe5f0, 0xc000020540)
/go/gos/go1.16.15/src/sync/once.go:68 +0xec fp=0xc000020530 sp=0xc0000204e0 pc=0x47d94c
sync.(*Once).Do(...)
/go/gos/go1.16.15/src/sync/once.go:59
go.etcd.io/bbolt.(*DB).loadFreelist(0xc0000fe480)
/go/pkg/mod/go.etcd.io/[email protected]/db.go:316 +0x6a fp=0xc000020560 sp=0xc000020530 pc=0x9cefaa
go.etcd.io/bbolt.Open(0xc000045098, 0x16, 0x180, 0xc000804540, 0x7f569c7ac108, 0x18, 0xc000045098)
/go/pkg/mod/go.etcd.io/[email protected]/db.go:293 +0x3af fp=0xc000021620 sp=0xc000020560 pc=0x9cea4f
go.etcd.io/etcd/server/v3/mvcc/backend.newBackend(0xc000045098, 0x16, 0x5f5e100, 0x2710, 0x1209b04, 0x7, 0x280000000, 0xc000806000, 0x0, 0x1386c00, ...)
/go/src/go.etcd.io/etcd/release/etcd/server/mvcc/backend/backend.go:187 +0x117 fp=0xc000021828 sp=0xc000021620 pc=0xb78e57
go.etcd.io/etcd/server/v3/mvcc/backend.New(...)
/go/src/go.etcd.io/etcd/release/etcd/server/mvcc/backend/backend.go:163
go.etcd.io/etcd/server/v3/etcdserver.newBackend(0x7ffeb005f9ee, 0x2, 0x0, 0x0, 0x0, 0x0, 0xc000562510, 0x1, 0x1, 0xc000562750, ...)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/backend.go:55 +0x1f8 fp=0xc0000219c0 sp=0xc000021828 pc=0xd035d8
go.etcd.io/etcd/server/v3/etcdserver.openBackend.func1(0xc0006ac120, 0xc000152600, 0x1386c00, 0xc0006a82d0)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/backend.go:76 +0x98 fp=0xc000021fc0 sp=0xc0000219c0 pc=0xd46f98
runtime.goexit()
/go/gos/go1.16.15/src/runtime/asm_amd64.s:1371 +0x1 fp=0xc000021fc8 sp=0xc000021fc0 pc=0x4722a1
created by go.etcd.io/etcd/server/v3/etcdserver.openBackend
/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/backend.go:75 +0x12b
And a SIGBUS:
{"level":"info","ts":"2022-06-09T13:40:58.556-0400","caller":"embed/etcd.go:308","msg":"starting an etcd server","etcd-version":"3.5.3","git-sha":"0452feec7","go-version":"go1.16.15","go-os":"linux","go-arch":"amd64","max-cpu-set":48,"max-cpu-available":48,"member-initialized":true,"name":"n2","data-dir":"n2.etcd","wal-dir":"","wal-dir-dedicated":"","member-dir":"n2.etcd/member","force-new-cluster":false,"heartbeat-interval":"100ms","election-timeout":"1s","initial-election-tick-advance":true,"snapshot-count":100,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["http://192.168.122.102:2380"],"listen-peer-urls":["http://192.168.122.102:2380"],"advertise-client-urls":["http://192.168.122.102:2379"],"listen-client-urls":["http://192.168.122.102:2379"],"listen-metrics-urls":[],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"","initial-cluster-state":"existing","initial-cluster-token":"","quota-size-bytes":2147483648,"pre-vote":true,"initial-corrupt-check":true,"corrupt-check-time-interval":"1m0s","auto-compaction-mode":"periodic","auto-compaction-retention":"0s","auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":"","downgrade-check-interval":"5s"}
unexpected fault address 0x7fbbc9e400d5
fatal error: fault
[signal SIGBUS: bus error code=0x2 addr=0x7fbbc9e400d5 pc=0x40256c]
goroutine 188 [running]:
runtime.throw(0x120762d, 0x5)
/go/gos/go1.16.15/src/runtime/panic.go:1117 +0x72 fp=0xc000179e28 sp=0xc000179df8 pc=0x4385b2
runtime.sigpanic()
/go/gos/go1.16.15/src/runtime/signal_unix.go:731 +0x2c8 fp=0xc000179e60 sp=0xc000179e28 pc=0x44fe28
cmpbody()
/go/gos/go1.16.15/src/internal/bytealg/compare_amd64.s:115 +0xec fp=0xc000179e68 sp=0xc000179e60 pc=0x40256c
bytes.Compare(...)
/go/gos/go1.16.15/src/bytes/bytes.go:27
go.etcd.io/bbolt.(*Cursor).nsearch.func2(0x1, 0x1)
/go/pkg/mod/go.etcd.io/[email protected]/cursor.go:328 +0x9a fp=0xc000179eb0 sp=0xc000179e68 pc=0x9dec1a
sort.Search(0xa, 0xc000179f50, 0x18)
/go/gos/go1.16.15/src/sort/search.go:66 +0x58 fp=0xc000179ee8 sp=0xc000179eb0 pc=0x49f698
go.etcd.io/bbolt.(*Cursor).nsearch(0xc00017a128, 0x7fbbc7e400b0, 0x5, 0x5)
/go/pkg/mod/go.etcd.io/[email protected]/cursor.go:327 +0x105 fp=0xc000179f98 sp=0xc000179ee8 pc=0x9cdec5
go.etcd.io/bbolt.(*Cursor).search(0xc00017a128, 0x7fbbc7e400b0, 0x5, 0x5, 0x4)
/go/pkg/mod/go.etcd.io/[email protected]/cursor.go:257 +0x1d4 fp=0xc00017a030 sp=0xc000179f98 pc=0x9cd8f4
go.etcd.io/bbolt.(*Cursor).seek(0xc00017a128, 0x7fbbc7e400b0, 0x5, 0x5, 0x4511e5, 0xc00000c660, 0x1ab0ff8, 0x0, 0xc00000c660, 0x0, ...)
/go/pkg/mod/go.etcd.io/[email protected]/cursor.go:159 +0x7d fp=0xc00017a0a0 sp=0xc00017a030 pc=0x9cd01d
go.etcd.io/bbolt.(*Bucket).Bucket(0xc000152398, 0x7fbbc7e400b0, 0x5, 0x5, 0x450fac)
/go/pkg/mod/go.etcd.io/[email protected]/bucket.go:105 +0xda fp=0xc00017a158 sp=0xc00017a0a0 pc=0x9c8eda
go.etcd.io/bbolt.(*Tx).checkBucket.func2(0x7fbbc7e400b0, 0x5, 0x5, 0x0, 0x0, 0x0, 0x0, 0x60)
/go/pkg/mod/go.etcd.io/[email protected]/tx.go:488 +0x70 fp=0xc00017a1b0 sp=0xc00017a158 pc=0x9dfbf0
go.etcd.io/bbolt.(*Bucket).ForEach(0xc000152398, 0xc00017a240, 0x0, 0xc00017a270)
/go/pkg/mod/go.etcd.io/[email protected]/bucket.go:390 +0x103 fp=0xc00017a220 sp=0xc00017a1b0 pc=0x9ca883
go.etcd.io/bbolt.(*Tx).checkBucket(0xc000152380, 0xc000152398, 0xc00017a3c0, 0xc00017a390, 0xc000598240)
/go/pkg/mod/go.etcd.io/[email protected]/tx.go:487 +0x146 fp=0xc00017a2b0 sp=0xc00017a220 pc=0x9dd726
go.etcd.io/bbolt.(*DB).freepages(0xc00002a000, 0x0, 0x0, 0x0)
/go/pkg/mod/go.etcd.io/[email protected]/db.go:1059 +0x210 fp=0xc00017a4a8 sp=0xc00017a2b0 pc=0x9d2390
go.etcd.io/bbolt.(*DB).loadFreelist.func1()
/go/pkg/mod/go.etcd.io/[email protected]/db.go:320 +0x114 fp=0xc00017a4e0 sp=0xc00017a4a8 pc=0x9dee14
sync.(*Once).doSlow(0xc00002a170, 0xc00017a540)
/go/gos/go1.16.15/src/sync/once.go:68 +0xec fp=0xc00017a530 sp=0xc00017a4e0 pc=0x47d94c
sync.(*Once).Do(...)
/go/gos/go1.16.15/src/sync/once.go:59
go.etcd.io/bbolt.(*DB).loadFreelist(0xc00002a000)
/go/pkg/mod/go.etcd.io/[email protected]/db.go:316 +0x6a fp=0xc00017a560 sp=0xc00017a530 pc=0x9cefaa
go.etcd.io/bbolt.Open(0xc000045470, 0x16, 0x180, 0xc000114b40, 0x7fbe6ef0d108, 0x18, 0xc000045470)
/go/pkg/mod/go.etcd.io/[email protected]/db.go:293 +0x3af fp=0xc00017b620 sp=0xc00017a560 pc=0x9cea4f
go.etcd.io/etcd/server/v3/mvcc/backend.newBackend(0xc000045470, 0x16, 0x5f5e100, 0x2710, 0x1209b04, 0x7, 0x280000000, 0xc000550000, 0x0, 0x1386c00, ...)
/go/src/go.etcd.io/etcd/release/etcd/server/mvcc/backend/backend.go:187 +0x117 fp=0xc00017b828 sp=0xc00017b620 pc=0xb78e57
go.etcd.io/etcd/server/v3/mvcc/backend.New(...)
/go/src/go.etcd.io/etcd/release/etcd/server/mvcc/backend/backend.go:163
go.etcd.io/etcd/server/v3/etcdserver.newBackend(0x7ffecd3bb9ee, 0x2, 0x0, 0x0, 0x0, 0x0, 0xc000573dd0, 0x1, 0x1, 0xc00057e090, ...)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/backend.go:55 +0x1f8 fp=0xc00017b9c0 sp=0xc00017b828 pc=0xd035d8
go.etcd.io/etcd/server/v3/etcdserver.openBackend.func1(0xc0005981e0, 0xc0001a0600, 0x1386c00, 0xc00057e2d0)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/backend.go:76 +0x98 fp=0xc00017bfc0 sp=0xc00017b9c0 pc=0xd46f98
runtime.goexit()
/go/gos/go1.16.15/src/runtime/asm_amd64.s:1371 +0x1 fp=0xc00017bfc8 sp=0xc00017bfc0 pc=0x4722a1
created by go.etcd.io/etcd/server/v3/etcdserver.openBackend
/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/backend.go:75 +0x12b
Other times we get a panic because of an index out of range:
{"level":"info","ts":"2022-06-09T13:38:01.213-0400","caller":"embed/etcd.go:308","msg":"starting an etcd server","etcd-version":"3.5.3","git-sha":"0452feec7","go-version":"go1.16.15","go-os":"linux","go-arch":"amd64","max-cpu-set":48,"max-cpu-available":48,"member-initialized":true,"name":"n1","data-dir":"n1.etcd","wal-dir":"","wal-dir-dedicated":"","member-dir":"n1.etcd/member","force-new-cluster":false,"heartbeat-interval":"100ms","election-timeout":"1s","initial-election-tick-advance":true,"snapshot-count":100,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["http://192.168.122.101:2380"],"listen-peer-urls":["http://192.168.122.101:2380"],"advertise-client-urls":["http://192.168.122.101:2379"],"listen-client-urls":["http://192.168.122.101:2379"],"listen-metrics-urls":[],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"","initial-cluster-state":"existing","initial-cluster-token":"","quota-size-bytes":2147483648,"pre-vote":true,"initial-corrupt-check":true,"corrupt-check-time-interval":"1m0s","auto-compaction-mode":"periodic","auto-compaction-retention":"0s","auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":"","downgrade-check-interval":"5s"}
panic: freepages: failed to get all reachable pages (page 536870914: out of bounds: 5)
goroutine 186 [running]:
go.etcd.io/bbolt.(*DB).freepages.func2(0xc0001a01e0)
/go/pkg/mod/go.etcd.io/[email protected]/db.go:1056 +0xe9
created by go.etcd.io/bbolt.(*DB).freepages
/go/pkg/mod/go.etcd.io/[email protected]/db.go:1054 +0x1cd
panic: runtime error: index out of range [9007199254740992] with length 281474976710655
goroutine 185 [running]:
go.etcd.io/bbolt.(*DB).page(...)
/go/pkg/mod/go.etcd.io/[email protected]/db.go:933
go.etcd.io/bbolt.(*Tx).page(...)
/go/pkg/mod/go.etcd.io/[email protected]/tx.go:617
go.etcd.io/bbolt.(*Tx).forEachPage(0xc0002fe460, 0x20000000000, 0x0, 0xc0004e2118)
/go/pkg/mod/go.etcd.io/[email protected]/tx.go:622 +0x14a
go.etcd.io/bbolt.(*Tx).checkBucket(0xc0002fe460, 0xc0001a2380, 0xc0004e23c0, 0xc0004e2390, 0xc0001a01e0)
/go/pkg/mod/go.etcd.io/[email protected]/tx.go:464 +0xd4
go.etcd.io/bbolt.(*Tx).checkBucket.func2(0x7fed5f876200, 0x3, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0)
/go/pkg/mod/go.etcd.io/[email protected]/tx.go:489 +0xc5
go.etcd.io/bbolt.(*Bucket).ForEach(0xc0002fe478, 0xc0004e2240, 0x0, 0xc0004e2270)
/go/pkg/mod/go.etcd.io/[email protected]/bucket.go:390 +0x103
go.etcd.io/bbolt.(*Tx).checkBucket(0xc0002fe460, 0xc0002fe478, 0xc0004e23c0, 0xc0004e2390, 0xc0001a01e0)
/go/pkg/mod/go.etcd.io/[email protected]/tx.go:487 +0x146
go.etcd.io/bbolt.(*DB).freepages(0xc000168240, 0x0, 0x0, 0x0)
/go/pkg/mod/go.etcd.io/[email protected]/db.go:1059 +0x210
go.etcd.io/bbolt.(*DB).loadFreelist.func1()
/go/pkg/mod/go.etcd.io/[email protected]/db.go:320 +0x114
sync.(*Once).doSlow(0xc0001683b0, 0xc0004e2540)
/go/gos/go1.16.15/src/sync/once.go:68 +0xec
sync.(*Once).Do(...)
/go/gos/go1.16.15/src/sync/once.go:59
go.etcd.io/bbolt.(*DB).loadFreelist(0xc000168240)
/go/pkg/mod/go.etcd.io/[email protected]/db.go:316 +0x6a
go.etcd.io/bbolt.Open(0xc0005131e8, 0x16, 0x180, 0xc000680600, 0x7ff006945108, 0x18, 0xc0005131e8)
/go/pkg/mod/go.etcd.io/[email protected]/db.go:293 +0x3af
go.etcd.io/etcd/server/v3/mvcc/backend.newBackend(0xc0005131e8, 0x16, 0x5f5e100, 0x2710, 0x1209b04, 0x7, 0x280000000, 0xc00011a0a0, 0x0, 0x1386c00, ...)
/go/src/go.etcd.io/etcd/release/etcd/server/mvcc/backend/backend.go:187 +0x117
go.etcd.io/etcd/server/v3/mvcc/backend.New(...)
/go/src/go.etcd.io/etcd/release/etcd/server/mvcc/backend/backend.go:163
go.etcd.io/etcd/server/v3/etcdserver.newBackend(0x7ffc6e5419ee, 0x2, 0x0, 0x0, 0x0, 0x0, 0xc0004b4120, 0x1, 0x1, 0xc0004b4360, ...)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/backend.go:55 +0x1f8
go.etcd.io/etcd/server/v3/etcdserver.openBackend.func1(0xc0001a0180, 0xc000154600, 0x1386c00, 0xc0001945a0)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/backend.go:76 +0x98
created by go.etcd.io/etcd/server/v3/etcdserver.openBackend
/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/backend.go:75 +0x12b
Or, in this test, a panic involving "invalid type: meta":
{"level":"info","ts":"2022-06-09T14:51:34.601-0400","caller":"embed/etcd.go:308","msg":"starting an etcd server","etcd-version":"3.5.3","git-sha":"0452feec7","go-version":"go1.16.15","go-os":"linux","go-arch":"amd64","max-cpu-set":48,"max-cpu-available":48,"member-initialized":true,"name":"n2","data-dir":"n2.etcd","wal-dir":"","wal-dir-dedicated":"","member-dir":"n2.etcd/member","force-new-cluster":false,"heartbeat-interval":"100ms","election-timeout":"1s","initial-election-tick-advance":true,"snapshot-count":100,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["http://192.168.122.102:2380"],"listen-peer-urls":["http://192.168.122.102:2380"],"advertise-client-urls":["http://192.168.122.102:2379"],"listen-client-urls":["http://192.168.122.102:2379"],"listen-metrics-urls":[],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"","initial-cluster-state":"existing","initial-cluster-token":"","quota-size-bytes":2147483648,"pre-vote":true,"initial-corrupt-check":true,"corrupt-check-time-interval":"1m0s","auto-compaction-mode":"periodic","auto-compaction-retention":"0s","auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":"","downgrade-check-interval":"5s"}
unexpected fault address 0x7f72695ab008
fatal error: fault
panic: freepages: failed to get all reachable pages (page 0: invalid type: meta)
goroutine 201 [running]:
go.etcd.io/bbolt.(*DB).freepages.func2(0xc00047e1e0)
/go/pkg/mod/go.etcd.io/[email protected]/db.go:1056 +0xe9
created by go.etcd.io/bbolt.(*DB).freepages
/go/pkg/mod/go.etcd.io/[email protected]/db.go:1054 +0x1cd
[signal SIGBUS: bus error code=0x2 addr=0x7f72695ab008 pc=0x9cd2f7]
goroutine 200 [running]:
runtime.throw(0x120762d, 0x5)
/go/gos/go1.16.15/src/runtime/panic.go:1117 +0x72 fp=0xc000191f40 sp=0xc000191f10 pc=0x4385b2
runtime.sigpanic()
/go/gos/go1.16.15/src/runtime/signal_unix.go:731 +0x2c8 fp=0xc000191f78 sp=0xc000191f40 pc=0x44fe28
go.etcd.io/bbolt.(*elemRef).isLeaf(...)
/go/pkg/mod/go.etcd.io/[email protected]/cursor.go:387
go.etcd.io/bbolt.(*Cursor).first(0xc000192098)
/go/pkg/mod/go.etcd.io/[email protected]/cursor.go:170 +0x1b7 fp=0xc000191fd8 sp=0xc000191f78 pc=0x9cd2f7
go.etcd.io/bbolt.(*Cursor).First(0xc000192098, 0x0, 0x0, 0x1ab0ff8, 0x9ddf89, 0x7f72685ab000, 0x0)
/go/pkg/mod/go.etcd.io/[email protected]/cursor.go:36 +0xc9 fp=0xc000192058 sp=0xc000191fd8 pc=0x9cc3c9
go.etcd.io/bbolt.(*Bucket).ForEach(0xc000018440, 0xc0001920e8, 0x0, 0xc000192118)
/go/pkg/mod/go.etcd.io/[email protected]/bucket.go:389 +0x86 fp=0xc0001920c8 sp=0xc000192058 pc=0x9ca806
go.etcd.io/bbolt.(*Tx).checkBucket(0xc00053c1c0, 0xc000018440, 0xc0001923c0, 0xc000192390, 0xc00047e1e0)
/go/pkg/mod/go.etcd.io/[email protected]/tx.go:487 +0x146 fp=0xc000192158 sp=0xc0001920c8 pc=0x9dd726
go.etcd.io/bbolt.(*Tx).checkBucket.func2(0x7f72685af16f, 0x7, 0x7, 0x0, 0x0, 0x0, 0x0, 0x0)
/go/pkg/mod/go.etcd.io/[email protected]/tx.go:489 +0xc5 fp=0xc0001921b0 sp=0xc000192158 pc=0x9dfc45
go.etcd.io/bbolt.(*Bucket).ForEach(0xc00053c1d8, 0xc000192240, 0x0, 0xc000192270)
/go/pkg/mod/go.etcd.io/[email protected]/bucket.go:390 +0x103 fp=0xc000192220 sp=0xc0001921b0 pc=0x9ca883
go.etcd.io/bbolt.(*Tx).checkBucket(0xc00053c1c0, 0xc00053c1d8, 0xc0001923c0, 0xc000192390, 0xc00047e1e0)
/go/pkg/mod/go.etcd.io/[email protected]/tx.go:487 +0x146 fp=0xc0001922b0 sp=0xc000192220 pc=0x9dd726
go.etcd.io/bbolt.(*DB).freepages(0xc000166240, 0x0, 0x0, 0x0)
/go/pkg/mod/go.etcd.io/[email protected]/db.go:1059 +0x210 fp=0xc0001924a8 sp=0xc0001922b0 pc=0x9d2390
go.etcd.io/bbolt.(*DB).loadFreelist.func1()
/go/pkg/mod/go.etcd.io/[email protected]/db.go:320 +0x114 fp=0xc0001924e0 sp=0xc0001924a8 pc=0x9dee14
sync.(*Once).doSlow(0xc0001663b0, 0xc000192540)
/go/gos/go1.16.15/src/sync/once.go:68 +0xec fp=0xc000192530 sp=0xc0001924e0 pc=0x47d94c
sync.(*Once).Do(...)
/go/gos/go1.16.15/src/sync/once.go:59
go.etcd.io/bbolt.(*DB).loadFreelist(0xc000166240)
/go/pkg/mod/go.etcd.io/[email protected]/db.go:316 +0x6a fp=0xc000192560 sp=0xc000192530 pc=0x9cefaa
go.etcd.io/bbolt.Open(0xc000045110, 0x16, 0x180, 0xc0000ea660, 0x7f750f63c108, 0x18, 0xc000045110)
/go/pkg/mod/go.etcd.io/[email protected]/db.go:293 +0x3af fp=0xc000193620 sp=0xc000192560 pc=0x9cea4f
go.etcd.io/etcd/server/v3/mvcc/backend.newBackend(0xc000045110, 0x16, 0x5f5e100, 0x2710, 0x1209b04, 0x7, 0x280000000, 0xc0000f4050, 0x0, 0x1386c00, ...)
/go/src/go.etcd.io/etcd/release/etcd/server/mvcc/backend/backend.go:187 +0x117 fp=0xc000193828 sp=0xc000193620 pc=0xb78e57
go.etcd.io/etcd/server/v3/mvcc/backend.New(...)
/go/src/go.etcd.io/etcd/release/etcd/server/mvcc/backend/backend.go:163
go.etcd.io/etcd/server/v3/etcdserver.newBackend(0x7ffebdfbd9ee, 0x2, 0x0, 0x0, 0x0, 0x0, 0xc0002745a0, 0x1, 0x1, 0xc0002747e0, ...)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/backend.go:55 +0x1f8 fp=0xc0001939c0 sp=0xc000193828 pc=0xd035d8
go.etcd.io/etcd/server/v3/etcdserver.openBackend.func1(0xc00047e180, 0xc000152600, 0x1386c00, 0xc00001a2d0)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/backend.go:76 +0x98 fp=0xc000193fc0 sp=0xc0001939c0 pc=0xd46f98
runtime.goexit()
/go/gos/go1.16.15/src/runtime/asm_amd64.s:1371 +0x1 fp=0xc000193fc8 sp=0xc000193fc0 pc=0x4722a1
created by go.etcd.io/etcd/server/v3/etcdserver.openBackend
/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/backend.go:75 +0x12b
Or messages about illegal tag 0
{"level":"info","ts":"2022-06-09T16:05:09.621-0400","caller":"etcdserver/backend.go:81","msg":"opened backend db","path":"n2.etcd/member/snap/db","took":"190.961µs"}
{"level":"info","ts":"2022-06-09T16:05:09.634-0400","caller":"embed/etcd.go:368","msg":"closing etcd server","name":"n2","data-dir":"n2.etcd","advertise-peer-urls":["http://192.168.122.102:2380"],"advertise-client-urls":["http://192.168.122.102:2379"]}
{"level":"info","ts":"2022-06-09T16:05:09.635-0400","caller":"embed/etcd.go:370","msg":"closed etcd server","name":"n2","data-dir":"n2.etcd","advertise-peer-urls":["http://192.168.122.102:2380"],"advertise-client-urls":["http://192.168.122.102:2379"]}
{"level":"fatal","ts":"2022-06-09T16:05:09.635-0400","caller":"etcdmain/etcd.go:204","msg":"discovery failed","error":"proto: Record: illegal tag 0 (wire type 0)","stacktrace":"go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:204\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/main.go:40\nmain.main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/main.go:32\nruntime.main\n\t/go/gos/go1.16.15/src/runtime/proc.go:225"}
Or panic: cannot use none as id:
version":"go1.16.15","go-os":"linux","go-arch":"amd64","max-cpu-set":48,"max-cpu-available":48,"member-initialized":true,"name":"n1","data-dir":"n1.etcd","wal-dir":"","wal-dir-dedicated":"","member-dir":"n1.etcd/member","force-new-cluster":false,"heartbeat-interval":"100ms","election-timeout":"1s","initial-election-tick-advance":true,"snapshot-count":100,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["http://192.168.122.101:2380"],"listen-peer-urls":["http://192.168.122.101:2380"],"advertise-client-urls":["http://192.168.122.101:2379"],"listen-client-urls":["http://192.168.122.101:2379"],"listen-metrics-urls":[],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"","initial-cluster-state":"existing","initial-cluster-token":"","quota-size-bytes":2147483648,"pre-vote":true,"initial-corrupt-check":true,"corrupt-check-time-interval":"1m0s","auto-compaction-mode":"periodic","auto-compaction-retention":"0s","auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":"","downgrade-check-interval":"5s"}
{"level":"info","ts":"2022-06-09T15:59:22.220-0400","caller":"etcdserver/backend.go:81","msg":"opened backend db","path":"n1.etcd/member/snap/db","took":"189.04µs"}
{"level":"info","ts":"2022-06-09T15:59:22.220-0400","caller":"etcdserver/server.go:529","msg":"No snapshot found. Recovering WAL from scratch!"}
{"level":"info","ts":"2022-06-09T15:59:22.222-0400","caller":"wal/repair.go:40","msg":"repairing","path":"n1.etcd/member/wal/0000000000000000-0000000000000000.wal"}
{"level":"info","ts":"2022-06-09T15:59:22.298-0400","caller":"wal/repair.go:96","msg":"repaired","path":"n1.etcd/member/wal/0000000000000000-0000000000000000.wal","error":"unexpected EOF"}
{"level":"info","ts":"2022-06-09T15:59:22.298-0400","caller":"etcdserver/storage.go:109","msg":"repaired WAL","error":"unexpected EOF"}
{"level":"warn","ts":"2022-06-09T15:59:22.299-0400","caller":"wal/util.go:90","msg":"ignored file in WAL directory","path":"0000000000000000-0000000000000000.wal.broken"}
{"level":"info","ts":"2022-06-09T15:59:22.299-0400","caller":"etcdserver/raft.go:483","msg":"restarting local member","cluster-id":"0","local-member-id":"0","commit-index":0}
panic: cannot use none as id
goroutine 1 [running]:
go.etcd.io/etcd/raft/v3.newRaft(0xc0007389e0, 0x7fdd9820b208)
/go/src/go.etcd.io/etcd/release/etcd/raft/raft.go:320 +0xd86
go.etcd.io/etcd/raft/v3.NewRawNode(0xc0007389e0, 0x1ab50a8, 0x7fddbf1a5b00, 0x203000)
/go/src/go.etcd.io/etcd/release/etcd/raft/rawnode.go:48 +0x2f
go.etcd.io/etcd/raft/v3.RestartNode(0xc0007389e0, 0xc000282010, 0xc0000f4050)
/go/src/go.etcd.io/etcd/release/etcd/raft/node.go:239 +0x45
go.etcd.io/etcd/server/v3/etcdserver.restartNode(0x7ffd7c1379ee, 0x2, 0x0, 0x0, 0x0, 0x0, 0xc00041a510, 0x1, 0x1, 0xc00041a750, ...)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/raft.go:509 +0x805
go.etcd.io/etcd/server/v3/etcdserver.NewServer(0x7ffd7c1379ee, 0x2, 0x0, 0x0, 0x0, 0x0, 0xc00041a510, 0x1, 0x1, 0xc00041a750, ...)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/server.go:533 +0x404e
go.etcd.io/etcd/server/v3/embed.StartEtcd(0xc00040e000, 0xc00040e600, 0x0, 0x0)
/go/src/go.etcd.io/etcd/release/etcd/server/embed/etcd.go:245 +0xef8
go.etcd.io/etcd/server/v3/etcdmain.startEtcd(0xc00040e000, 0x120893e, 0x6, 0xc00026c201, 0x2)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:228 +0x32
go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2(0xc00020c000, 0x19, 0x1a)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:123 +0x257a
go.etcd.io/etcd/server/v3/etcdmain.Main(0xc00020c000, 0x19, 0x1a)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/main.go:40 +0x13f
main.main()
/go/src/go.etcd.io/etcd/release/etcd/server/main.go:32 +0x45
What did you expect to happen?
I know that corruption checks are experimental, so this bug is likely a low-priority issue, but it does feel like a corrupt data file should result in something a little less spooky than a SIGBUS or SIGSEGV.
How can we reproduce it (as minimally and precisely as possible)?
Check out https://github.com/jepsen-io/etcd 9624a6cebb051856622b27bd3878b1b2797d9fe6 and run (e.g.)
lein run test -w append --concurrency 2n --time-limit 300 --rate 1 --nemesis corrupt,kill --test-count 5
Anything else we need to know?
No response
Etcd version (please run commands below)
$ etcd --version
etcd Version: 3.5.3
Git SHA: 0452feec7
Go Version: go1.16.15
Go OS/Arch: linux/amd64
$ etcdctl version
etcdctl version: 3.5.3
API version: 3.5
Etcd configuration (command line flags or environment variables)
Each node is started without a config file, using CLI flags like
etcd --enable-v2 --log-outputs stderr --logger zap --name n1 --listen-peer-urls http://192.168.122.101:2380 --listen-client-urls http://192.168.122.101:2379 --advertise-client-urls http://192.168.122.101:2379 --initial-cluster-state new --initial-advertise-peer-urls http://192.168.122.101:2380 --initial-cluster
Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)
root@n1:/opt/etcd# ./etcdctl member list -w table --endpoints=n1:2379
+------------------+---------+------+-----------------------------+-----------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+---------+------+-----------------------------+-----------------------------+------------+
| 1153c9690d2b2284 | started | n3 | http://192.168.122.103:2380 | http://192.168.122.103:2379 | false |
| 4824313a421b2502 | started | n5 | http://192.168.122.105:2380 | http://192.168.122.105:2379 | false |
| 4d6e27d122507e9c | started | n4 | http://192.168.122.104:2380 | http://192.168.122.104:2379 | false |
| a1ffd5acd6a88a6a | started | n2 | http://192.168.122.102:2380 | http://192.168.122.102:2379 | false |
| afa39e55dee6dc2e | started | n1 | http://192.168.122.101:2380 | http://192.168.122.101:2379 | false |
+------------------+---------+------+-----------------------------+-----------------------------+------------+
root@n1:/opt/etcd# ./etcdctl --endpoints=http://192.168.122.105:2379,http://192.168.122.103:2379,http://192.168.122.104:2379,http://192.168.122.102:2379,http://192.168.122.101:2379 endpoint status -w table
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| http://192.168.122.105:2379 | 4824313a421b2502 | 3.5.3 | 4.2 MB | true | false | 20 | 60226 | 60226 | |
| http://192.168.122.103:2379 | 1153c9690d2b2284 | 3.5.3 | 4.2 MB | false | false | 20 | 60226 | 60226 | |
| http://192.168.122.104:2379 | 4d6e27d122507e9c | 3.5.3 | 4.2 MB | false | false | 20 | 60228 | 60228 | |
| http://192.168.122.102:2379 | a1ffd5acd6a88a6a | 3.5.3 | 4.2 MB | false | false | 20 | 60228 | 60228 | |
| http://192.168.122.101:2379 | afa39e55dee6dc2e | 3.5.3 | 4.2 MB | false | false | 20 | 60228 | 60228 | |
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
Relevant log output
No response
@aphyr Did you manually & intentionally modify the db
file?
Yup! That's what Jepsen is for--fault injection testing. You might recall etcd contracting me to do this same kind of work in 2019. :-)
Thanks for the feedback. The BoltDB file isn't supposed to be manually modified, even in test. Just curious, had you ever seen such issue previously when you did the same test?
I mean yeah, sure, hardware is supposed to be perfect! However, non-ECC machines, disks, faulty network controllers, bad VM hypervisors, et al do occasionally cause bit-flip errors. Given that etcd has already done some work to detect these kinds of errors (for instance, given the CRC checks and --experimental-initial-corrupt-check
flag) I figure this might be within y'all's fault model. If nothing else, it points to a fruitful avenue for testing the corruption checker in the future.
And no, our previous tests didn't perform this type of fault injection. This is new work, motivated by faults that real systems have exhibited in the past.
Note that CRC
check is only for the WAL file for now, and --experimental-initial-corrupt-check
can't resolve data corruption, and the existing implementation also has flaw. Of course, we are working to improve it.
The BoltDB manages data via B+ tree page by page, and it has delicate structure. The data file can only be updated by BoltDB itself.
We do see a couple of data file corruption issue previously, i.e. 13406. I may consider to deliver a tool to automatically recover & fix the corrupted data file in future.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.