substrate statedb: allow longer state pruning history

Fixes #11911

This PR introduces a database-backed queue for the state pruning window, which only keeps a few blocks in memory and loads more blocks on demand, thus reducing the burden of setting a long state pruning history.

Note that the queue required the backend database to support reference counting (i.e parity-db), databases that do not support reference counting (i.e rocksdb) still need to keep all blocks in memory

polkadot address: 1CWyso9Zw46QdR54sTvKJXTUdmZznYZa2aJfNHdG6xQ6L71

Aug 04 '22 23:08 NingLin-P

User @NingLin-P, please sign the CLA here.

Aug 04 '22 23:08 cla-bot-2021[bot]

PTAL @arkpar

Aug 04 '22 23:08 NingLin-P

Thank you for the PR! Consider adding your polkadot address in the format described here.

For now I only took a basic look and it looks mostly good. There's just one issue. I don't think there's any real need to maintain DbBacked::hashs. The problem with hashs is that whie the memory requirement is much smaller, it is still O(n). The bigger problem is that it requires loading all journal records on startup, which can take a long time with millions of records.

We can always get the next hash to prune from the database by loading the next journal record. For the next_hash to work, we can preload the next record. So instead of hashs we can maintain next which is a preloaded JournalRecord for the next block to pune. When it is time to prune a block 'N', we just use the data from next and then preload the next block ('N' + 1) from the database into next.

Aug 05 '22 06:08 arkpar

There's just one issue. I don't think there's any real need to maintain DbBacked::hashs. The problem with hashs is that whie the memory requirement is much smaller, it is still O(n). The bigger problem is that it requires loading all journal records on startup, which can take a long time with millions of records.

Indeed, DbBacked::hashs still have these problems but the real need to maintain DbBacked::hashs is RefWindow::have_block (we should replace hashs: VecDeque with BTreeSet though).

I have also checked the upper callers that required RefWindow::have_block but it seems hard to remove DbBacked::hashs without other ways to determine if a block is canonicalized (then we can implement have_block just by checking the block number)

Aug 05 '22 10:08 NingLin-P

I have also checked the upper callers that required RefWindow::have_block but it seems hard to remove DbBacked::hashs without other ways to determine if a block is canonicalized (then we can implement have_block just by checking the block number)

All the block in RefWindow are guaranteed to be canonicalized. Blocks that are not yet canonicalized are handled by NonCanonicalOverlay. So that's not an issue.

have_block is surrently used in

is_pruned - we can check by number here.
pin - it is fine to just remove the check and always add to pinned set.

Aug 05 '22 11:08 arkpar

All the block in RefWindow are guaranteed to be canonicalized

But the questioning block may not be canonicalized (like the parameter pass to is_pruned), the calling chain of is_pruned involved a lot of functions of the sc_client_api::backend::Backend trait implement for sc_client_db::Backend (i.e like Backend::state_at)

Aug 05 '22 11:08 NingLin-P

is_pruned and pin already check if the block is in the non-canonicalized overlay by calling NonCanonicalOverlay::have_block. Not to be confused with RefWindow::have_block. That's a different have_block function and there's no need to change it.

The block passed to is_pruned, or rather have_state_at may be in one of the following states:

Pending canonicalization - it should be in NonCanonicalOverlay and is_pruned will return it here: https://github.com/paritytech/substrate/blob/92517913b239a6e20f8b26bad9ce2238e9b978e8/client/state-db/src/lib.rs#L355
Pending pruning. This may be checked by number as long as the header is known.
Already pruned. This is the case where the header is in the db, but the number is out of the pruning window.
Unknow, as in was never inserted. This is when the header is missing.

So I suppose we'd need to document that is_pruned and pin should ony be called for blocks with existing headers and add an additional check that the header exists in have_state_at and it should be fine.

For additional safety I'd add a number argument to pin and check that the number is in the ref window there as well.

Aug 05 '22 12:08 arkpar

@arkpar Thanks for the explanation! I have changed the implementation, but two tests of is_pruned is broken, will fix it later.

Aug 05 '22 20:08 NingLin-P

PTAL again @arkpar

Aug 08 '22 15:08 NingLin-P

After some consideration I can see that is_pruned/pin is still a bit problematic, considering that the caller ineed does not know when the hash is taken into account or the number. I think we can fix this with the following:

pin should return an enum

Pinned when the ginven hash and number are definietely known to the state DB. This is when the block is pending canonicalization, or when the hash is in the in-mem backend for RefWindow or in the cache.
MaybePinned this is when the block with given number is in the pruning window, but can't verify that the hash matches. MaybePinned should also be returned when archive mode is enabled. When this is returned, the backend in sc-client-db should do an additonal check for state root, like this: https://github.com/paritytech/substrate/blob/d042666e55956080a2fc02f4e601cc2c1b839af0/client/db/src/lib.rs#L2286-L2290
Err(..) When we know that the block definietely does not exist.

is_pruned should have a similar enum return value with similar check.

A call to have_state_at before pin should be removed. This use case is racy anyway. You can't do anything useful with the value returned by have_state_at since some other thread my cause pruning to happen and the result is obsolete right after the call. The only use for have_state_at should be for the cases when the result being up to date is not that important. E.g. when showing block status to the user.

Aug 08 '22 17:08 arkpar

Comments addressed

pin is implemented a bit differently, because pin need to pin the block according to the have_block result immediately, so we can't return a MaybePin to the caller to do further checking then pin again. I pass a closure to pin instead, which will do the further checking inside pin if needed.

Aug 09 '22 10:08 NingLin-P

Perhaps we can handle the MayHaveBlock case in a simpler way. Currently, with the block number passing to RefWindow::have_block we can load the exact single journal record from db and check if its hash matches the hash parameter (thanks to the block number we don't need to iterate over the whole queue to find the hash as the previous implementation does).

Since the sp_state_machine::Storage::get may also result in a db read, this approach does not bring much cost but is much simpler.

Aug 09 '22 16:08 NingLin-P

pin is implemented a bit differently, because pin need to pin the block according to the have_block result immediately, so we can't return a MaybePin to the caller to do further checking then pin again. I pass a closure to pin instead, which will do the further checking inside pin if needed.

I was imagning that MaybePinned actuall places the pin, and if the additional check fails, unpin should be called. But closure works as well. pin should be made to return a guard really, but that's out of scope of this PR.

Perhaps we can handle the MayHaveBlock case in a simpler way. Currently, with the block number passing to RefWindow::have_block we can load the exact single journal record from db and check if its hash matches the hash parameter (thanks to the block number we don't need to iterate over the whole queue to find the hash as the previous implementation does).

The difference is how much data needs to be queried from the DB. Single root trie node is small, but the journal record may contain a lot of keys. An alternative storage scheme would me more efficient, but I don't think it is worth the effort to change it now.

Aug 09 '22 17:08 arkpar

should we really put db as a field (i am a bit uncomfortable with the Clone bound on Metadb, but with current underlying type it is fine, just wouldn't want actual data being cloned by mistake)

The Clone bound is indeed quite strange for Metadb, how about we remove this Clone bound and pass a D instead of &D, such that we don't need to do the cloning in RefWindow::new but in client/db/src/lib.rs where we can see it is wrapped by an Arc

the batch logic of the cache ( uncached_block counter) may not be the more direct, maybe switching to a simple fifo cache (I mean one that always add to cache).

Actually, the pruning window is a FIFO queue now, blocks are put to the back (import ) and pop out from the head (prune_one and apply_pending). The cache is only a slice of successive blocks at the head of the window (it is also a FIFO queue), to reduce the burden of keeping all block in memory thus allow longer state pruning window.

PS: now that we have DB in type param, function insert_block of state-db/src/lib.rs can just return D::Error (it is the last function using some error as function generic).

insert_block actually not toughing the db at all, the error it may return is StateDbError from non_canonical

Aug 10 '22 17:08 NingLin-P

insert_block actually not toughing the db at all, the error it may return is StateDbError from non_canonical might be the same error type, using D::Error was not really an issue when I tested.

Aug 10 '22 18:08 cheme

Actually, the pruning window is a FIFO queue now, blocks are put to the back (import ) and pop out from the head (prune_one and apply_pending). The cache is only a slice of successive blocks at the head of the window (it is also a FIFO queue), to reduce the burden of keeping all block in memory thus allow longer state pruning window.

yes fifo was wrong wording from me, what I meant was in your code you give priority to have first out block over first in; opposite could be done. The fact that we force consecutive range makes it we are emptying back of vecdeque, batch refill it, emptying, and so on, so one idea would be to just keep index of block and not force consecutive range (avoid batch of query this way). But that is only for when the pruning range don't fit in memory.

Aug 10 '22 19:08 cheme

so one idea would be to just keep index of block and not force consecutive range (avoid batch of query this way). But that is only for when the pruning range don't fit in memory.

Oh, the code emphasizes consecutive a lot is because blocks are pruning in consecutive order. The pruning window is a FIFO queue, new canonicalized block put in the back of the queue, when the queue's size is exceed the configured size, the block in the head is pop out and pruned until the size <= the configured size again. Since the blocks in the pruing window is only used for pruning, we only load the block into cache when it is gonna be pruned.

Aug 10 '22 19:08 NingLin-P

I think I looked in the wrong direction, what matters is to have one for the case where the window is in the default size (I believe it is 256 from https://github.com/paritytech/substrate/blob/a1a9b475c354d873f91135ed5fa028ac9ef6a2c4/client/state-db/src/lib.rs#L269 , should be exposed and reuse as a constant). I was looking at what happen when we got bigger window than the cache, and then just not having the active caching is also fine.

Aug 11 '22 08:08 cheme

bot merge

Sep 01 '22 13:09 arkpar

Waiting for commit status.

Sep 01 '22 13:09 paritytech-processbot[bot]

@NingLin-P and @arkpar I am testing this. I can see the effect in memory but I can not query more than 4096 blocks still, I have filled https://github.com/paritytech/polkadot/issues/6378 while trying to understand what is going on.

Dec 02 '22 08:12 rvalle

substrate substrate copied to clipboard

statedb: allow longer state pruning history

substrate
substrate copied to clipboard