bbolt Enhance the test to reproduce data corruption issues

See https://github.com/etcd-io/etcd/issues/16596#issuecomment-1719519519

Sep 15 '23 08:09 ahrtr

@fuweid are you interested in working on this? The goal is to try to reproduce the data corruption issues.

My high level thought is:

Implement a simple example application which concurrently reads and writes bbolt db file; Reference concurrent_test.go
Intentionally injects failpoints (e.g. forcibly kill, lazyFS split_write, etc) to crash the application;
check whether the db file is corrupted or not.

We can discuss the details on next meeting.

Note: the goal is to reproduce the data corruption issues instead of verify correctness.

Reference: https://github.com/dsrhaslab/lazyfs

cc @tjungblu

Nov 06 '23 15:11 ahrtr

Sure! Let me take this one.

Nov 06 '23 15:11 fuweid

Hi @ahrtr , I tried to use dm-flakey device to drop_writes during block layer and cause data corruption. The test case is stable to reproduce the data corruption.

https://github.com/fuweid/go-dmflakey#example (second one)
https://github.com/fuweid/go-dmflakey/blob/main/contrib/datacorruption/boltdb/main_test.go#L42

$ go test -c -o /tmp/test ./datacorruption/boltdb && sudo /tmp/test -test.v
=== RUN   TestDropWritesDuringBench
    main_test.go:56: Init empty bbolt database with 128 MiB
    main_test.go:60: Ensure the empty boltdb data persisted in the flakey device
    main_test.go:63: Start to run bbolt-bench
    main_test.go:85: Drop all the write IOs after 3 seconds
    main_test.go:89: Let bbolt-bench run with DropWrites mode for 3 seconds
    main_test.go:92: Start to allow all the write IOs for 2 seconds
    main_test.go:96: Kill the bbolt process and simulate power failure
    main_test.go:101: Invoke bbolt check to verify data
    main_test.go:103:
                Error Trace:    /home/fuwei/workspace/go-dmflakey/contrib/datacorruption/boltdb/main_test.go:103
                Error:          Received unexpected error:
                                exit status 2
                Test:           TestDropWritesDuringBench
                Messages:       bbolt check output: panic: invalid freelist page: 0, page type is unknown<00>

                                goroutine 1 [running]:
                                go.etcd.io/bbolt.(*freelist).read(0x0?, 0x0?)
                                        /home/fuwei/workspace/bbolt/freelist.go:270 +0x199
                                go.etcd.io/bbolt.(*DB).loadFreelist.func1()
                                        /home/fuwei/workspace/bbolt/db.go:400 +0xc5
                                sync.(*Once).doSlow(0xc0001301c0?, 0x584020?)
                                        /usr/local/go/src/sync/once.go:74 +0xc2
                                sync.(*Once).Do(...)
                                        /usr/local/go/src/sync/once.go:65
                                go.etcd.io/bbolt.(*DB).loadFreelist(0xc000130000?)
                                        /home/fuwei/workspace/bbolt/db.go:393 +0x47
                                go.etcd.io/bbolt.Open({0x7ffc60ce9530, 0x38}, 0x670060?, 0xc00005fc18)
                                        /home/fuwei/workspace/bbolt/db.go:275 +0x44f
                                main.(*checkCommand).Run(0xc000137e58, {0xc0000161a0, 0x1, 0x1})
                                        /home/fuwei/workspace/bbolt/cmd/bbolt/main.go:212 +0x1e5
                                main.(*Main).Run(0xc00005ff40, {0xc000016190?, 0xc0000061a0?, 0x200000003?})
                                        /home/fuwei/workspace/bbolt/cmd/bbolt/main.go:124 +0x4d4
                                main.main()
                                        /home/fuwei/workspace/bbolt/cmd/bbolt/main.go:62 +0xae
--- FAIL: TestDropWritesDuringBench (8.29s)
FAIL

I think if the lazyFS can drop writes silently, that will be perfect for simulation. We can discuss this in next meeting.

Nov 18 '23 10:11 fuweid

@fuweid thx for the test case.

I can reproduce the same error using your test case.
After applying a simple patch (the first commit), then no reproduction on the issue any more. I think the key point is the system call syscall.Fdatasync's semantics is not respected. Please see more detailed info in https://github.com/fuweid/go-dmflakey/pull/1

We can have more discussion on Monday's meeting.

Nov 18 '23 14:11 ahrtr

Confirmed that the file.WriteAt never fails when running your test case, nor the fdatasync.

https://github.com/etcd-io/bbolt/blob/33db274134cc841af39ceb09013e16686636ae1f/tx.go#L457
https://github.com/etcd-io/bbolt/blob/33db274134cc841af39ceb09013e16686636ae1f/tx.go#L511
https://github.com/etcd-io/bbolt/blob/33db274134cc841af39ceb09013e16686636ae1f/bolt_linux.go#L9

I added log in https://github.com/etcd-io/bbolt/pull/616, also updated your test case in https://github.com/fuweid/go-dmflakey/pull/1. So it should be the test case's issue (specifically it should be flaky filesystem's issue) instead of bbolt's issue.

# go test -v
=== RUN   TestDropWritesDuringBench
    main_test.go:57: Init empty bbolt database with 128 MiB
    main_test.go:61: Ensure the empty boltdb data persisted in the flakey device
    main_test.go:64: Start to run bbolt-bench
    main_test.go:92: Drop all the write IOs after 3 seconds
    main_test.go:96: Let bbolt-bench run with DropWrites mode for 3 seconds
    main_test.go:99: Start to allow all the write IOs for 2 seconds
    main_test.go:103: Kill the bbolt process and simulate power failure
    main_test.go:86: ####### bbolt output: 
         work: /tmp/TestDropWritesDuringBench3215672155/001/root/boltdb
        starting write benchmark.
        Completed 910 requests, 909/s 
        Completed 1805 requests, 894/s 
        Completed 2705 requests, 899/s 
        Completed 79925 requests, 77209/s 
        Completed 154430 requests, 74501/s 
        Completed 226100 requests, 71659/s 
        Completed 227845 requests, 1744/s 
        Completed 228710 requests, 864/s 
    main_test.go:115: Invoke bbolt check to verify data: "/tmp/boltdb"
    main_test.go:117: 
        	Error Trace:	/root/go/src/github.com/fuweid/go-dmflakey/contrib/datacorruption/boltdb/main_test.go:117
        	Error:      	Received unexpected error:
        	            	exit status 2
        	Test:       	TestDropWritesDuringBench
        	Messages:   	bbolt check output: panic: invalid freelist page: 0, page type is unknown<00>
        	            	
        	            	goroutine 1 [running]:
        	            	go.etcd.io/bbolt.(*freelist).read(0x0?, 0x0?)
        	            		/root/go/src/github.com/ahrtr/bbolt/freelist.go:270 +0x179
        	            	go.etcd.io/bbolt.(*DB).loadFreelist.func1()
        	            		/root/go/src/github.com/ahrtr/bbolt/db.go:400 +0xb8
        	            	sync.(*Once).doSlow(0xc0001341c0?, 0x581580?)
        	            		/root/software/go/src/sync/once.go:74 +0xbf
        	            	sync.(*Once).Do(...)
        	            		/root/software/go/src/sync/once.go:65
        	            	go.etcd.io/bbolt.(*DB).loadFreelist(0xc000134000?)
        	            		/root/go/src/github.com/ahrtr/bbolt/db.go:393 +0x45
        	            	go.etcd.io/bbolt.Open({0x7fff4ee2c62a, 0xb}, 0x677960?, 0xc00005fbd0)
        	            		/root/go/src/github.com/ahrtr/bbolt/db.go:275 +0x425
        	            	main.(*checkCommand).Run(0xc00013be18, {0xc000016140, 0x1, 0x1})
        	            		/root/go/src/github.com/ahrtr/bbolt/cmd/bbolt/main.go:212 +0x1df
        	            	main.(*Main).Run(0xc00005ff00, {0xc000016130?, 0xc000006340?, 0x200000003?})
        	            		/root/go/src/github.com/ahrtr/bbolt/cmd/bbolt/main.go:124 +0x469
        	            	main.main()
        	            		/root/go/src/github.com/ahrtr/bbolt/cmd/bbolt/main.go:62 +0xa6
--- FAIL: TestDropWritesDuringBench (8.35s)
FAIL
exit status 1
FAIL	github.com/fuweid/go-dmflakey/contrib/datacorruption/boltdb	8.350s

Nov 19 '23 13:11 ahrtr

Hi @ahrtr , The original case is used to simulate the data loss in the device level. The DropWrites failpoint is to drop all the data submitted by fdatasync or fsync. It's like power failure and cause data committed loss.

I was trying to use the following script in pod:

run-bench-test &
sleep random-seconds
echo b > /proc/sysrq-trigger

And run pod as daemonset in the kubernetes cluster. It can't reproduce the data corruption. So I inject failpoint in device level to drop the writes, which seems to change one byte in the data.

I think we can add failpoint between writeAt and fdatasync so that we can see the recovery after power failure.

We can sync the detail in tomorrow meeting.

Nov 19 '23 14:11 fuweid

@fuweid can you add the test case TestDropWritesDuringBench (Possibly need to change the name) into bbolt?

Nov 22 '23 10:11 ahrtr

@ahrtr Hi, I am still working on this. will file pull request when it's ready

Nov 22 '23 11:11 fuweid

bbolt bbolt copied to clipboard

Enhance the test to reproduce data corruption issues

bbolt
bbolt copied to clipboard