bbolt
bbolt copied to clipboard
Enhance the test to reproduce data corruption issues
See https://github.com/etcd-io/etcd/issues/16596#issuecomment-1719519519
@fuweid are you interested in working on this? The goal is to try to reproduce the data corruption issues.
My high level thought is:
- Implement a simple example application which concurrently reads and writes bbolt db file; Reference concurrent_test.go
- Intentionally injects failpoints (e.g. forcibly kill, lazyFS split_write, etc) to crash the application;
- check whether the db file is corrupted or not.
We can discuss the details on next meeting.
Note: the goal is to reproduce the data corruption issues instead of verify correctness.
Reference: https://github.com/dsrhaslab/lazyfs
cc @tjungblu
Sure! Let me take this one.
Hi @ahrtr , I tried to use dm-flakey device to drop_writes during block layer and cause data corruption. The test case is stable to reproduce the data corruption.
- https://github.com/fuweid/go-dmflakey#example (second one)
- https://github.com/fuweid/go-dmflakey/blob/main/contrib/datacorruption/boltdb/main_test.go#L42
$ go test -c -o /tmp/test ./datacorruption/boltdb && sudo /tmp/test -test.v
=== RUN TestDropWritesDuringBench
main_test.go:56: Init empty bbolt database with 128 MiB
main_test.go:60: Ensure the empty boltdb data persisted in the flakey device
main_test.go:63: Start to run bbolt-bench
main_test.go:85: Drop all the write IOs after 3 seconds
main_test.go:89: Let bbolt-bench run with DropWrites mode for 3 seconds
main_test.go:92: Start to allow all the write IOs for 2 seconds
main_test.go:96: Kill the bbolt process and simulate power failure
main_test.go:101: Invoke bbolt check to verify data
main_test.go:103:
Error Trace: /home/fuwei/workspace/go-dmflakey/contrib/datacorruption/boltdb/main_test.go:103
Error: Received unexpected error:
exit status 2
Test: TestDropWritesDuringBench
Messages: bbolt check output: panic: invalid freelist page: 0, page type is unknown<00>
goroutine 1 [running]:
go.etcd.io/bbolt.(*freelist).read(0x0?, 0x0?)
/home/fuwei/workspace/bbolt/freelist.go:270 +0x199
go.etcd.io/bbolt.(*DB).loadFreelist.func1()
/home/fuwei/workspace/bbolt/db.go:400 +0xc5
sync.(*Once).doSlow(0xc0001301c0?, 0x584020?)
/usr/local/go/src/sync/once.go:74 +0xc2
sync.(*Once).Do(...)
/usr/local/go/src/sync/once.go:65
go.etcd.io/bbolt.(*DB).loadFreelist(0xc000130000?)
/home/fuwei/workspace/bbolt/db.go:393 +0x47
go.etcd.io/bbolt.Open({0x7ffc60ce9530, 0x38}, 0x670060?, 0xc00005fc18)
/home/fuwei/workspace/bbolt/db.go:275 +0x44f
main.(*checkCommand).Run(0xc000137e58, {0xc0000161a0, 0x1, 0x1})
/home/fuwei/workspace/bbolt/cmd/bbolt/main.go:212 +0x1e5
main.(*Main).Run(0xc00005ff40, {0xc000016190?, 0xc0000061a0?, 0x200000003?})
/home/fuwei/workspace/bbolt/cmd/bbolt/main.go:124 +0x4d4
main.main()
/home/fuwei/workspace/bbolt/cmd/bbolt/main.go:62 +0xae
--- FAIL: TestDropWritesDuringBench (8.29s)
FAIL
I think if the lazyFS can drop writes silently, that will be perfect for simulation. We can discuss this in next meeting.
@fuweid thx for the test case.
- I can reproduce the same error using your test case.
- After applying a simple patch (the first commit), then no reproduction on the issue any more. I think the key point is the system call syscall.Fdatasync's semantics is not respected. Please see more detailed info in https://github.com/fuweid/go-dmflakey/pull/1
We can have more discussion on Monday's meeting.
Confirmed that the file.WriteAt never fails when running your test case, nor the fdatasync.
- https://github.com/etcd-io/bbolt/blob/33db274134cc841af39ceb09013e16686636ae1f/tx.go#L457
- https://github.com/etcd-io/bbolt/blob/33db274134cc841af39ceb09013e16686636ae1f/tx.go#L511
- https://github.com/etcd-io/bbolt/blob/33db274134cc841af39ceb09013e16686636ae1f/bolt_linux.go#L9
I added log in https://github.com/etcd-io/bbolt/pull/616, also updated your test case in https://github.com/fuweid/go-dmflakey/pull/1. So it should be the test case's issue (specifically it should be flaky filesystem's issue) instead of bbolt's issue.
# go test -v
=== RUN TestDropWritesDuringBench
main_test.go:57: Init empty bbolt database with 128 MiB
main_test.go:61: Ensure the empty boltdb data persisted in the flakey device
main_test.go:64: Start to run bbolt-bench
main_test.go:92: Drop all the write IOs after 3 seconds
main_test.go:96: Let bbolt-bench run with DropWrites mode for 3 seconds
main_test.go:99: Start to allow all the write IOs for 2 seconds
main_test.go:103: Kill the bbolt process and simulate power failure
main_test.go:86: ####### bbolt output:
work: /tmp/TestDropWritesDuringBench3215672155/001/root/boltdb
starting write benchmark.
Completed 910 requests, 909/s
Completed 1805 requests, 894/s
Completed 2705 requests, 899/s
Completed 79925 requests, 77209/s
Completed 154430 requests, 74501/s
Completed 226100 requests, 71659/s
Completed 227845 requests, 1744/s
Completed 228710 requests, 864/s
main_test.go:115: Invoke bbolt check to verify data: "/tmp/boltdb"
main_test.go:117:
Error Trace: /root/go/src/github.com/fuweid/go-dmflakey/contrib/datacorruption/boltdb/main_test.go:117
Error: Received unexpected error:
exit status 2
Test: TestDropWritesDuringBench
Messages: bbolt check output: panic: invalid freelist page: 0, page type is unknown<00>
goroutine 1 [running]:
go.etcd.io/bbolt.(*freelist).read(0x0?, 0x0?)
/root/go/src/github.com/ahrtr/bbolt/freelist.go:270 +0x179
go.etcd.io/bbolt.(*DB).loadFreelist.func1()
/root/go/src/github.com/ahrtr/bbolt/db.go:400 +0xb8
sync.(*Once).doSlow(0xc0001341c0?, 0x581580?)
/root/software/go/src/sync/once.go:74 +0xbf
sync.(*Once).Do(...)
/root/software/go/src/sync/once.go:65
go.etcd.io/bbolt.(*DB).loadFreelist(0xc000134000?)
/root/go/src/github.com/ahrtr/bbolt/db.go:393 +0x45
go.etcd.io/bbolt.Open({0x7fff4ee2c62a, 0xb}, 0x677960?, 0xc00005fbd0)
/root/go/src/github.com/ahrtr/bbolt/db.go:275 +0x425
main.(*checkCommand).Run(0xc00013be18, {0xc000016140, 0x1, 0x1})
/root/go/src/github.com/ahrtr/bbolt/cmd/bbolt/main.go:212 +0x1df
main.(*Main).Run(0xc00005ff00, {0xc000016130?, 0xc000006340?, 0x200000003?})
/root/go/src/github.com/ahrtr/bbolt/cmd/bbolt/main.go:124 +0x469
main.main()
/root/go/src/github.com/ahrtr/bbolt/cmd/bbolt/main.go:62 +0xa6
--- FAIL: TestDropWritesDuringBench (8.35s)
FAIL
exit status 1
FAIL github.com/fuweid/go-dmflakey/contrib/datacorruption/boltdb 8.350s
Hi @ahrtr , The original case is used to simulate the data loss in the device level. The DropWrites failpoint is to drop all the data submitted by fdatasync or fsync. It's like power failure and cause data committed loss.
I was trying to use the following script in pod:
run-bench-test &
sleep random-seconds
echo b > /proc/sysrq-trigger
And run pod as daemonset in the kubernetes cluster. It can't reproduce the data corruption. So I inject failpoint in device level to drop the writes, which seems to change one byte in the data.
I think we can add failpoint between writeAt and fdatasync so that we can see the recovery after power failure.
We can sync the detail in tomorrow meeting.
@fuweid can you add the test case TestDropWritesDuringBench (Possibly need to change the name) into bbolt?
@ahrtr Hi, I am still working on this. will file pull request when it's ready