reth
reth copied to clipboard
High-level spec for `Full Node` and `Snapshots`
High Level
Some draft on the snapshots and full node high level specs. Joining them here since they do share some important similarities, and it might help to keep that in mind. They both work on sliding window intervals during syncing, and any table that can be pruned is a candidate for a snapshot file in the end.
Categories (prunable or moveable to snapshots)
- Historical:
AccountChangeSet/StorageChangeSet/AccountHistory/StorageHistory - Transactions:
Transactions/TxSenders/TxLookup/TxHashNumber/TransactionBlock Receipts
Full Node:
- Default pruning. keep data related to the most 256 recent blocks.
- Any table which has a
BlockNumberorTxNumberas key is a candidate for pruning. - Customization on what are the most recent blocks [ allow keeping none of the block data, apart from state after verifying ]
- Each category can be fully pruned (eg. no receipts) if desired so.
- Prune levels should allow skipping certain stages (eg.
TxSenders/TxLookup)
Snapshots:
- Fixed interval. [ Every N blocks, or every N transactions ]
- Any table which has a
BlockNumberorTxNumberas key is a candidate for moving to static files. - Reproducible snapshots. Should include an option to sync from beginning and let the node create all its own snapshots. They should match existing ones.
- Shared through a centralized host and/or p2p
- Static file per category or per table
- Perfect hashing table for keys
- Compressed values (overall dictionary (across 0 - 17M blocks) / per file / or both [double compression])
Syncing
-
Full node:
- Requests block height from CL.
- Sync with pruned options. [eg. don't store/calculate changesets until the most recent blocks or receipt or TxSenders etc.].
- When approaching a block number close to
block_height - 256, recheck latest block height. - Once it reaches the desired height, register process to
wake-up register(more on it below) and sleep.
-
Snapshots:
- ??
Tip
Register
- Wake-up register [condition -> wake-up registered process (channel) -> execute]
- Loop through registered processes
- If condition is met, wake-up registered process
- Execute. [full-node: clean-up tables | snapshot: move from tables]
- Either on success or failure, update the condition with a new one.
- Once a block has been handled check wake-up register conditions. [eg. static file waits for an exact block height while full node might execute every N blocks]
Priority
- Full node syncing
- Wake-up register
- Full-node @ tip wake-up process (registration & table clean-up)
- Snapshot @ tip wake-up process (registration & moving table data )
- Snapshot syncing
We need to specify in here how this would even work given the current architecture, i.e.:
- What assumptions do we make in the stages, tree, and RPC about data availability? (e.g. unwinds assume changesets are available for exec/merkle)
- Since we only have one write transaction, how will we periodically delete data from the tables (both for pruning and snapshots) while syncing?
- How do we adjust our current database abstraction to work with static files? I assume the static files generated by "snapshots" can also be used by the node as a secondary store for historical data to keep the MDBX database small? This was not mentioned in here, but it should be accounted for
@shekhirin will take this
Data Availability assumptions
- In case of a reorg, we need to have account and storage changesets available up to the reorged block to unwind the execution, hashing and merkle stages. It applies to both pipeline and blockchain tree unwinds. We should be able to find an optimal and safe value for max reorg depth, e.g. 2 epochs = 64 blocks.
- CL needs logs (and hence, receipts) available from the block where
DepositContractwas deployed (or the first deposit was made). It's possible for the CL to operate without these logs if it syncs from the checkpoint. Or, we could leave only deposit transaction receipts which are required for the CL'seth_getLogsto succeed. - Pruning headers and bodies by default is most likely a bad idea, because it affects the network health by being unable to fulfil devp2p peer requests
GetBlockHeadersandGetBlockBodies. But having it as a configurable option might be a good idea.
Pruning
Pruning during the pipeline sync
- We want to do a pruned initial sync, not requiring the user to wait until the node fully syncs first, and only then prune.
- It will work by calculating and persisting the data only from the requested pruning height.
Pruning during the live sync (blockchain tree)
- We want the node to continue syncing and being responsive to RPC requests.
- After processing of every new block, the background pruning task (see below) will handle pruning.
Background pruning task
- Listens to new canonical blocks via CanonStateNotifications.
- Has a configurable minimal pruning interval, which determines how often we can prune the data across all stages.
- Setting the minimal pruning interval to 1 block (i.e. pruning after every new block) is not ideal because the disk will wear out faster.
- If the minimal pruning interval condition is met, checks if any stages require pruning, and prunes the data if so.
Special case for a full node, i.e. pruning all data
- For both pipeline and blockchain tree, we will need to not write the data that was requested to be fully pruned AT ALL, so we wouldn't do double work: write, and then immediately prune on the next pruning interval.
Interface
- We have a
DatabaseProvidertrait, which currently contains the methods for inserting/appending and unwinding the data. - We will also add the pruning methods to it, and call these methods from the blockchain tree.
Not persisting any changesets until the desired pruning height
I'd recommend avoiding this if we want to support forward syncing in the future, since we can only really do this because backwards sync gives us some sort of guarantee that all historical blocks should be executable with no issues. But if we forward sync, that might not be the case - we may encounter invalid blocks, and in that case we would need to unwind, which relies on changesets. So it would have to be a sliding window, i.e. always save e.g. the last 256 blocks of changesets in execution, and remove any older ones every time we commit.
If we don't want to think about this now, we should at least note it down in the implementation so we remember later
This is a good point.
So it would have to be a sliding window, i.e. always save e.g. the last 256 blocks of changesets in execution, and remove any older ones every time we commit.
It should work, yeah, updated my comment.
Brain dump on what can be pruned and the side effects of it:
- Changesets and History can be pruned, it will affect the
eth_getStorageAt,eth_getBalanceAtand tracing RPC methods - Transaction Senders can be pruned, it will affect the execution performance
- Transaction can be pruned, it will affect the network health
- Receipts can be pruned, it will affect the validator ability to startup without a checkpoint +
eth_getLogs,eth_getFilterLogs, etc. RPC methods - Transaction Lookup Index can be pruned, it will affect the
eth_getTransactionByHash,eth_getTransactionReceipt,debug_getRawTransactionRPC methods as it's used to get the transaction by hash - Plain State can't be pruned
- Hashed State and Tries can't be pruned because recalculation takes a lot of time and we need it for the chain to progress
Rough minimum size estimation is:
- Transactions (435GB)
- Receipts (only deposit transactions for validator to execute
eth_getLogsto get deposits, not sure how much it is but I'd say <5GB) - Plain State (95GB)
- Hashed State (85GB)
- Tries (24GB)
Total is ~650GB
Very nice. @joshieDo @shekhirin:
- WDYT about making the transactions table even smaller? How much do we gain if we use a minimal perfect hash function to 'compress' the keys when we put the transactions on a static file?
Avalanche also implemented a bespoke Snap sync mechanism which gossips state diffs (instead of raw state) which avoids the "healing" process of Geth's Snap sync, maybe something to learn here: https://github.com/ava-labs/avalanchego/tree/master/x/sync.
Please ref https://github.com/paradigmxyz/reth/issues/2753 as well for additional ideas