[Fix] Unpause atomic writes on block insert error
This PR contains a potential fix to one of the atomic panics we've seen on the devnet:
May 15 15:55:28 ip-172-31-42-88 snarkos[58891]: 2024-05-15T15:55:28.771227Z WARN snarkos_node_sync::block_sync: Attempted to insert a block at the incorrect height into storage
May 15 15:55:29 ip-172-31-42-88 snarkos[58891]: thread 'tokio-runtime-worker' panicked at /Users/nkls/dev/snarkVM/ledger/store/src/helpers/rocksdb/internal/mod.rs:235:9:
May 15 15:55:29 ip-172-31-42-88 snarkos[58891]: assertion failed: !already_paused
I initially thought the error might suggest the occurrence of simultaneous block inserts from different threads (there are no await points so it can't be from the same task) but the block lock ruled this out. This leaves us with improper handling of unpausing which this changeset aims to fix.
Drafting this PR as I haven't yet been able to confirm this fix is sufficient to resolve the issue entirely.
Is there a way to force this error to occur and validate that this fix correctly addresses the failure case?
In addition, is there a way to determine why the rocksDB write might have failed? Could there be any issues with the BFTPersistentStorage rocksDB usage in snarkOS? Are there any locks we need to enforce between the two rocksdb writes (ledger vs bft)?
@niklaslong correct me if I'm mistaken, but I believe the error was not rocksDB-related, but rather "couldn't insert block at height N", i.e. a logic error.
I believe the only errors in self.block_store().insert(block) are the merkle tree update or RocksDB insert.
BlockStore::insert can also fail as such:
if block.height() != u32::try_from(updated_tree.number_of_leaves())? - 1 {
bail!("Attempted to insert a block at the incorrect height into storage")
}
Yes, the error I managed to trigger on the devnet was the "incorrect height" ☝️
force this error to occur and validate that this fix correctly addresses the failure case?
I didn't have any further luck last week on the devnets but we could try to trigger an error on insert manually to check if the issue can be replicated that way. It would at least validate that it's possible to trigger it on that codepath.
Is this ready for review or is there still changes/testing to be done?
Should be good to go in theory. Fabiano has encountered the error again on a client, I've sent him a patched branch to try, hopefully we can validate the fix in practice that way 🤞
Tested this on my 3-node setup and this works. Node would still corrupt the database or stop syncing for unknown reasons, but there's no assertion failed: !already_paused anymore.
@HarukaMa thank you for testing the branch! Did you spot any other atomic related panics in your logs?
@HarukaMa thank you for testing the branch! Did you spot any other atomic related panics in your logs?
hmm, after grepping the logs I did see a few panics about 16 hours ago, but as the stderr is not logged to log file I'm not sure what they are about now.
Ran e2e.yml integration test on this, passed succesfully.