snarkVM icon indicating copy to clipboard operation
snarkVM copied to clipboard

[Feature] Ability to revert ledger db

Open HarukaMa opened this issue 7 months ago • 9 comments

🚀 Feature

Add ability to revert ledger database in case of corruption.

Motivation

Currently, if the ledger database is corrupted, the only way to recover is to either resync from genesis or download the ledger snapshot. Resyncing is not really practical right now, as syncing 7 million blocks would take months even from CDN syncing; downloading the snapshot is largely limited by network - currently the snapshot is sized at 400+ GB, and I can currently download the file at only around 500-600 Mbps, which will take 2 hours to download the whole file. Either way, the downtime would be too long to be considered comfortable to potentially recover a downed service.

Most of the corruptions would happen around at the tip of the database (incorrect current state etc), so most of the data in the database would be actually fine, and it's very wasteful to discard them altogether with corrupted data. If we are able to revert the database to a previous point, we could avoid the costly recover methods we currently are using.

Implementation

The major issue with the database data would be the mapping storage (FinalizeStorage), as it represents the current state of the mapping storage with no history. To enable reverting, we would need to add periodic snapshots to the mapping storage, so we can directly go back to that point in time. For other data, it would still be tedious to implement the reverting, as we would need to find all data related to the block being reverted and remove them one by one, and we don't have foreign keys to automatically cascade the deletion.

Are you willing to open a pull request? (See CONTRIBUTING) hmm

HarukaMa avatar May 08 '25 03:05 HarukaMa

Thank you for bringing this up Haruka. 🙏 I agree this is an important concern to speed up our recovery - in line with @damons recent warnings. @Meshiest do you have any recommendations for the design addressing @HarukaMa 's concerns given your experience with snops?

vicsn avatar May 08 '25 11:05 vicsn

We implemented something called "Checkpoints" which allows for parallelized removal of blocks. We periodically create a copy of the mappings while our nodes are running. When rollback is desired the current mappings are deleted and the copy is re-inserted. The latest round also needs to be updated given the desired height.

This does not magically fix ledger corruption but it does allow for walking back any number of blocks if a node forks or misses an update window

Meshiest avatar May 08 '25 14:05 Meshiest

Additionally we should consider clearing the current proposal cache programmatically, given that it will reference transmissions which may not be valid anymore.

vicsn avatar May 09 '25 14:05 vicsn

@vicsn It does that too with this trick:

        // the act of creating this ledger service with a "max_gc_rounds" set to 0
        // should clear all BFT documents
        let ledger_service = Arc::new(CoreLedgerService::new(ledger.clone(), Default::default()));
        Storage::new(ledger_service, Arc::new(BFTMemoryService::new()), 0);

Meshiest avatar May 09 '25 14:05 Meshiest

I agree that we need a robust way to facilitate rollbacks in case of any issues, ranging from just the tip to a full corruption due to any reasons.

Proposal: full rocksdb-native checkpoints

rocksdb checkpoints

These are read-only (while live) snapshots of rocksdb databases which can be used as full storage backups. They are based on hard links (which means the disk use is typically lower than a raw OS copy if it's kept locally), and can be created live with relatively little runtime performance impact.

There are 2 possible approaches to handling them - one is directly using checkpoints and the other is the backup API built on top of them. It should be noted that creating rocksdb backups via raw OS copies while it's running is not recommended, as it might lead to inconsistent metadata; only the 2 following methods are officially supported for this purpose.

direct checkpoints

pros:

  • very small diff
  • full freedom in terms of creation time (can be done via a simple POST request to the REST server)
  • easy to automate
  • can be saved to any path that may change without restarts

cons:

  • rollbacks are manual (by replacing the ledger with the checkpoint or pointing to its path)
  • old snapshots need to be handled manually (unless additionally automated)

the backup API

pros:

  • old snapshots can be removed automatically (by leaving N latest ones)
  • a rollback from the last checkpoint can be done automatically (e.g. snarkos --restore-backup)
  • rollbacks from older checkpoints are also possible if the backup ID is given

cons:

  • larger diff
  • the backups are all local and at the same location as the ledger
  • a slight performance penalty (the BackupEngine object should be live at all times)
  • a more involved configuration
  • full automation requires a dedicated background task or binding to an event (e.g. every N blocks)
  • obtaining the backup ID requires either manual probing or handling it from REST server's response (if it's not done automatically by snarkOS)

Implementation details

I've already implemented (snarkVM, snarkOS) a simple checkpoint-based solution that can be triggered through the REST server and serve both as local incremental backups, and full remote backups; it could be extended so that periodic backups are also performed automatically every N blocks (locally, in case only a partial rollback is needed).

I'm now doing some additional benchmarking, but I'd be happy to get some feedback in the meantime.

ljedrz avatar May 13 '25 09:05 ljedrz

It should be noted that creating rocksdb backups via raw OS copies while it's running is not recommended, as it might lead to inconsistent metadata; only the 2 following methods are officially supported for this purpose.

Hmm, given this alone, the snapshots seem to be essential for us.

However, checkpointing the db at every single block won't be feasible, so even if we were to use rocksdb's native checkpointing for general backups, we should have an answer for validators to recover all the way to a specific block height. Two questions:

  1. What is less brittle, less likely to have inconsistent data @Meshiest or @ljedrz approach?
  2. Let's say, hypothetically, the latest block is incorrect or corrupt for 90% of validators, can they sync from the valid 10%? @kaimast ?
  3. Let's say, hypothetically, the latest block is incorrect or corrupt for 100% of validators, can they sync from the fastsync server up to a specific block? @kaimast

Along with rolling out fixes, we should do some testing to cover questions 2. and 3.

Super grateful for everyone bringing this up and helping. All of this should not be a huge lift but will make the software a lot more robust.

vicsn avatar May 13 '25 11:05 vicsn

I've checked how long the native rocksdb checkpoints take for the full ledger, and it was no more than 1/4th of a second (regardless if it's the 1st or any subsequent checkpoint) locally (it would take longer on a network share, but it should work via external means once a local checkpoint is available), so they are certainly a viable approach, albeit not after each block.

ljedrz avatar May 19 '25 07:05 ljedrz

We could also ensure that updates are written as a single batch (isn't the code already doing this?) or even as a rocksDB transaction. That way, the on-disk state would always be consistent.

Another option would be to have something like fsck that checks for inconsistencies and moves the ledger to some consistent state. This would most likely involve re-executing some blocks, from genesis or from a checkpoint.

kaimast avatar May 20 '25 17:05 kaimast

As far as I'm aware, there hasn't been a true corruption in a long time - all the writes are indeed atomic.

ljedrz avatar May 20 '25 17:05 ljedrz