foundationdb
foundationdb copied to clipboard
Redwood is not handling injected corrupted bit properly
In DIskFailureCycle, random bits are flipped when writing to disk. When the data is read, Redwood will realize the inconsistent checksum and raise page_decode_failed
exception, with asInjectedFault
enabled. This exception will be translated to a broken_promise
in VersionedBTree::commit_impl
and fail the storage server, updateStorage
. This triggers the storage server failure and causes a SevError
.
Must be a cancellation order problem. @xis19 Do you have a repro seed/hash and which compiler was used? It would be useful to know if the commit path or the read path first saw the checksum error.
Must be a cancellation order problem. @xis19 Do you have a repro seed/hash and which compiler was used? It would be useful to know if the commit path or the read path first saw the checksum error.
I think it is the readPhysicalPage
finding this checksum error at page->postReadPayload(pageID)
. The error will get propagated to VersionedBTree::commit_impl
.
@sfc-gh-satherton -- to reproduce the situation, please go to #8946 , use the 33b1c00b3965b80f58922a4d5dedd199b7a3764a hash and try
bin/fdbserver -r simulation --crash -s 519637099 -b off -f /root/src/tests/slow/DiskFailureCycle.toml
The corrupted page is 4248, at storage server with ID fc20d299b8ccb7a1e28f5635b71cebb3. When committing version 93, the commit_impl will throw the broken_promise. The function is too long so I am having difficulty digging into the root cause.
#8946 is based on 31dd702f8c15c54d8dd66113050aebead817c008
I don't recall investigating this specific seed/commit, but since the original commit some other things about error propagation and futures held in VersionedBTree have been refactored so I don't think this problem would still exist.