snarkVM [Fix] Unpause atomic writes on block insert error

This PR contains a potential fix to one of the atomic panics we've seen on the devnet:

May 15 15:55:28 ip-172-31-42-88 snarkos[58891]: 2024-05-15T15:55:28.771227Z  WARN snarkos_node_sync::block_sync: Attempted to insert a block at the incorrect height into storage
May 15 15:55:29 ip-172-31-42-88 snarkos[58891]: thread 'tokio-runtime-worker' panicked at /Users/nkls/dev/snarkVM/ledger/store/src/helpers/rocksdb/internal/mod.rs:235:9:
May 15 15:55:29 ip-172-31-42-88 snarkos[58891]: assertion failed: !already_paused

I initially thought the error might suggest the occurrence of simultaneous block inserts from different threads (there are no await points so it can't be from the same task) but the block lock ruled this out. This leaves us with improper handling of unpausing which this changeset aims to fix.

Drafting this PR as I haven't yet been able to confirm this fix is sufficient to resolve the issue entirely.

May 17 '24 19:05 niklaslong

Is there a way to force this error to occur and validate that this fix correctly addresses the failure case?

In addition, is there a way to determine why the rocksDB write might have failed? Could there be any issues with the BFTPersistentStorage rocksDB usage in snarkOS? Are there any locks we need to enforce between the two rocksdb writes (ledger vs bft)?

May 20 '24 16:05 raychu86

@niklaslong correct me if I'm mistaken, but I believe the error was not rocksDB-related, but rather "couldn't insert block at height N", i.e. a logic error.

May 20 '24 17:05 ljedrz

I believe the only errors in self.block_store().insert(block) are the merkle tree update or RocksDB insert.

May 20 '24 17:05 raychu86

BlockStore::insert can also fail as such:

if block.height() != u32::try_from(updated_tree.number_of_leaves())? - 1 {
    bail!("Attempted to insert a block at the incorrect height into storage")
}

May 20 '24 17:05 ljedrz

Yes, the error I managed to trigger on the devnet was the "incorrect height" ☝️

force this error to occur and validate that this fix correctly addresses the failure case?

I didn't have any further luck last week on the devnets but we could try to trigger an error on insert manually to check if the issue can be replicated that way. It would at least validate that it's possible to trigger it on that codepath.

May 20 '24 17:05 niklaslong

Is this ready for review or is there still changes/testing to be done?

May 22 '24 19:05 raychu86

Should be good to go in theory. Fabiano has encountered the error again on a client, I've sent him a patched branch to try, hopefully we can validate the fix in practice that way 🤞

May 22 '24 21:05 niklaslong

Tested this on my 3-node setup and this works. Node would still corrupt the database or stop syncing for unknown reasons, but there's no assertion failed: !already_paused anymore.

May 28 '24 14:05 HarukaMa

@HarukaMa thank you for testing the branch! Did you spot any other atomic related panics in your logs?

May 29 '24 07:05 niklaslong

@HarukaMa thank you for testing the branch! Did you spot any other atomic related panics in your logs?

hmm, after grepping the logs I did see a few panics about 16 hours ago, but as the stderr is not logged to log file I'm not sure what they are about now.

May 29 '24 08:05 HarukaMa

Ran e2e.yml integration test on this, passed succesfully.

Jun 04 '24 13:06 vicsn