foundationdb icon indicating copy to clipboard operation
foundationdb copied to clipboard

Redwood is not handling injected corrupted bit properly

Open xis19 opened this issue 2 years ago • 3 comments

In DIskFailureCycle, random bits are flipped when writing to disk. When the data is read, Redwood will realize the inconsistent checksum and raise page_decode_failed exception, with asInjectedFault enabled. This exception will be translated to a broken_promise in VersionedBTree::commit_impl and fail the storage server, updateStorage. This triggers the storage server failure and causes a SevError.

xis19 avatar Dec 07 '22 08:12 xis19

Must be a cancellation order problem. @xis19 Do you have a repro seed/hash and which compiler was used? It would be useful to know if the commit path or the read path first saw the checksum error.

sfc-gh-satherton avatar Dec 07 '22 16:12 sfc-gh-satherton

Must be a cancellation order problem. @xis19 Do you have a repro seed/hash and which compiler was used? It would be useful to know if the commit path or the read path first saw the checksum error.

I think it is the readPhysicalPage finding this checksum error at page->postReadPayload(pageID). The error will get propagated to VersionedBTree::commit_impl.

xis19 avatar Dec 07 '22 18:12 xis19

@sfc-gh-satherton -- to reproduce the situation, please go to #8946 , use the 33b1c00b3965b80f58922a4d5dedd199b7a3764a hash and try

bin/fdbserver -r simulation --crash -s 519637099 -b off -f /root/src/tests/slow/DiskFailureCycle.toml

The corrupted page is 4248, at storage server with ID fc20d299b8ccb7a1e28f5635b71cebb3. When committing version 93, the commit_impl will throw the broken_promise. The function is too long so I am having difficulty digging into the root cause.

#8946 is based on 31dd702f8c15c54d8dd66113050aebead817c008

xis19 avatar Dec 07 '22 18:12 xis19

I don't recall investigating this specific seed/commit, but since the original commit some other things about error propagation and futures held in VersionedBTree have been refactored so I don't think this problem would still exist.

sfc-gh-satherton avatar Aug 11 '23 19:08 sfc-gh-satherton