[DO NOT REVIERW] try to reproduce the db corruption issue
Please do not review this PR.
FYI. I am trying to reproduce the db corruption issue using the following script + this PR.
#!/usr/bin/env bash
# Please run this script at the root directory of the bbolt repository
# using command something like below,
# nohup ./reproduce_corruption.sh > test.log &
set -euo pipefail
go build ./cmd/bbolt/
minwait=100
maxwait=250
for i in {1..10000}
do
echo
echo "-----------------------------------"
echo "Round $i: $(date)"
rm -f case.log || true
TEST_CONCURRENT_CASE_DURATION=300s go test -run TestConcurrentGenericReadAndWrite -v > case.log &
sleep $((minwait + RANDOM % (maxwait-minwait)))
pid=$(ps -ef | grep bbolt | grep -v grep | awk '{print $2}')
echo "Killing ${pid}..."
kill -9 ${pid}
sleep 10
echo "Checking db consistency..."
./bbolt page ./bbolt.db 0
./bbolt page ./bbolt.db 1
./bbolt check ./bbolt.db
sleep 5
done
echo "All done!"
It has been running for 4 days and 19 hours. NO any issue so far! It's still running.
-----------------------------------
Round 1: Sun 9 Jun 11:33:54 PDT 2024
Killing 385023...
Checking db consistency...
Page ID: 0
Page Type: meta
Total Size: 4096 bytes
Overflow pages: 0
Version: 2
Page Size: 4096 bytes
Flags: 00000000
Root: <pgid=9>
Freelist: <pgid=4>
HWM: <pgid=1219>
Txn ID: 2812
Checksum: cc65e26ed4ea7d37
Page ID: 1
Page Type: meta
Total Size: 4096 bytes
Overflow pages: 0
Version: 2
Page Size: 4096 bytes
Flags: 00000000
Root: <pgid=9>
Freelist: <pgid=12>
HWM: <pgid=1219>
Txn ID: 2811
Checksum: b50666a0f0774fcc
OK
-----------------------------------
Round 2: Sun 9 Jun 11:36:20 PDT 2024
Killing 385117...
Checking db consistency...
Page ID: 0
Page Type: meta
Total Size: 4096 bytes
Overflow pages: 0
Version: 2
Page Size: 4096 bytes
Flags: 00000000
Root: <pgid=17>
Freelist: <pgid=7>
HWM: <pgid=1003>
Txn ID: 2710
Checksum: de3e350417492bcf
Page ID: 1
Page Type: meta
Total Size: 4096 bytes
Overflow pages: 0
Version: 2
Page Size: 4096 bytes
Flags: 00000000
Root: <pgid=27>
Freelist: <pgid=28>
HWM: <pgid=1014>
Txn ID: 2711
Checksum: 8328e60775bb0806
OK
-----------------------------------
......
-----------------------------------
Round 2193: Fri 14 Jun 06:30:09 PDT 2024
cc @fuweid @tjungblu @ivanvc @Elbehery
It has been running for about 18 days. NO any issue so far! It's still running.
-----------------------------------
Round 8093: Thu 27 Jun 06:58:41 PDT 2024
I left it running on two machines, too. And so far, neither has presented any issues.
-----------------------------------
Round 5678: Thu Jun 27 06:10:26 PM UTC 2024
-----------------------------------
Round 5692: Thu Jun 27 06:11:30 PM UTC 2024
I left it running on two machines, too. And so far, neither has presented any issues.
thx.
- Most of the corruption issues were caused by machine/VM suddenly power off. I think we need to simulate the similar scenario (power off), e.g using nested virtualization.
- Some issues (e.g. #778, #705) indicate that there might be potential issue(s) in the freelist management. We need to invest more effort on the test and review & refactor on that.
No any issue after 22 days' continuous running.
-----------------------------------
Round 1: Sun 9 Jun 11:33:54 PDT 2024
Killing 385023...
Checking db consistency...
Page ID: 0
Page Type: meta
Total Size: 4096 bytes
Overflow pages: 0
Version: 2
Page Size: 4096 bytes
Flags: 00000000
Root: <pgid=9>
Freelist: <pgid=4>
HWM: <pgid=1219>
Txn ID: 2812
Checksum: cc65e26ed4ea7d37
Page ID: 1
Page Type: meta
Total Size: 4096 bytes
Overflow pages: 0
Version: 2
Page Size: 4096 bytes
Flags: 00000000
Root: <pgid=9>
Freelist: <pgid=12>
HWM: <pgid=1219>
Txn ID: 2811
Checksum: b50666a0f0774fcc
OK
......
-----------------------------------
Round 10000: Mon 1 Jul 11:05:39 PDT 2024
Killing 1242747...
Checking db consistency...
Page ID: 0
Page Type: meta
Total Size: 4096 bytes
Overflow pages: 0
Version: 2
Page Size: 4096 bytes
Flags: 00000000
Root: <pgid=33>
Freelist: <pgid=66>
HWM: <pgid=1521>
Txn ID: 3988
Checksum: 9db4db789a9ef0f7
Page ID: 1
Page Type: meta
Total Size: 4096 bytes
Overflow pages: 0
Version: 2
Page Size: 4096 bytes
Flags: 00000000
Root: <pgid=65>
Freelist: <pgid=4>
HWM: <pgid=1513>
Txn ID: 3987
Checksum: 0a6f95bfd5ea1742
OK
All done!
@ahrtr: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
| Test name | Commit | Details | Required | Rerun command |
|---|---|---|---|---|
| pull-bbolt-robustness-arm64 | 2735c9f3d40dbe0d3602441018c1bb9806a693f4 | link | true | /test pull-bbolt-robustness-arm64 |
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.