substrate
substrate copied to clipboard
statedb: allow longer state pruning history
Fixes #11911
This PR introduces a database-backed queue for the state pruning window, which only keeps a few blocks in memory and loads more blocks on demand, thus reducing the burden of setting a long state pruning history.
Note that the queue required the backend database to support reference counting (i.e parity-db
), databases that do not support reference counting (i.e rocksdb
) still need to keep all blocks in memory
polkadot address: 1CWyso9Zw46QdR54sTvKJXTUdmZznYZa2aJfNHdG6xQ6L71
User @NingLin-P, please sign the CLA here.
PTAL @arkpar
Thank you for the PR! Consider adding your polkadot address in the format described here.
For now I only took a basic look and it looks mostly good. There's just one issue. I don't think there's any real need to maintain DbBacked::hashs
. The problem with hashs
is that whie the memory requirement is much smaller, it is still O(n). The bigger problem is that it requires loading all journal records on startup, which can take a long time with millions of records.
We can always get the next hash to prune from the database by loading the next journal record. For the next_hash
to work, we can preload the next record. So instead of hashs
we can maintain next
which is a preloaded JournalRecord for the next block to pune. When it is time to prune a block 'N', we just use the data from next
and then preload the next block ('N' + 1) from the database into next
.
There's just one issue. I don't think there's any real need to maintain DbBacked::hashs. The problem with hashs is that whie the memory requirement is much smaller, it is still O(n). The bigger problem is that it requires loading all journal records on startup, which can take a long time with millions of records.
Indeed, DbBacked::hashs
still have these problems but the real need to maintain DbBacked::hashs
is RefWindow::have_block
(we should replace hashs: VecDeque
with BTreeSet
though).
I have also checked the upper callers that required RefWindow::have_block
but it seems hard to remove DbBacked::hashs
without other ways to determine if a block is canonicalized (then we can implement have_block
just by checking the block number)
I have also checked the upper callers that required RefWindow::have_block but it seems hard to remove DbBacked::hashs without other ways to determine if a block is canonicalized (then we can implement have_block just by checking the block number)
All the block in RefWindow
are guaranteed to be canonicalized. Blocks that are not yet canonicalized are handled by NonCanonicalOverlay
. So that's not an issue.
have_block
is surrently used in
-
is_pruned
- we can check by number here. -
pin
- it is fine to just remove the check and always add to pinned set.
All the block in RefWindow are guaranteed to be canonicalized
But the questioning block may not be canonicalized (like the parameter pass to is_pruned
), the calling chain of is_pruned
involved a lot of functions of the sc_client_api::backend::Backend
trait implement for sc_client_db::Backend
(i.e like Backend::state_at
)
is_pruned
and pin
already check if the block is in the non-canonicalized overlay by calling NonCanonicalOverlay::have_block
. Not to be confused with RefWindow::have_block
. That's a different have_block
function and there's no need to change it.
The block passed to is_pruned
, or rather have_state_at
may be in one of the following states:
- Pending canonicalization - it should be in
NonCanonicalOverlay
andis_pruned
will return it here: https://github.com/paritytech/substrate/blob/92517913b239a6e20f8b26bad9ce2238e9b978e8/client/state-db/src/lib.rs#L355 - Pending pruning. This may be checked by number as long as the header is known.
- Already pruned. This is the case where the header is in the db, but the number is out of the pruning window.
- Unknow, as in was never inserted. This is when the header is missing.
So I suppose we'd need to document that is_pruned
and pin
should ony be called for blocks with existing headers and add an additional check that the header exists in have_state_at
and it should be fine.
For additional safety I'd add a number argument to pin
and check that the number is in the ref window there as well.
@arkpar Thanks for the explanation! I have changed the implementation, but two tests of is_pruned
is broken, will fix it later.
PTAL again @arkpar
After some consideration I can see that is_pruned/pin is still a bit problematic, considering that the caller ineed does not know when the hash is taken into account or the number. I think we can fix this with the following:
pin
should return an enum
-
Pinned
when the ginven hash and number are definietely known to the state DB. This is when the block is pending canonicalization, or when the hash is in the in-mem backend forRefWindow
or in the cache. -
MaybePinned
this is when the block with given number is in the pruning window, but can't verify that the hash matches.MaybePinned
should also be returned when archive mode is enabled. When this is returned, the backend insc-client-db
should do an additonal check for state root, like this: https://github.com/paritytech/substrate/blob/d042666e55956080a2fc02f4e601cc2c1b839af0/client/db/src/lib.rs#L2286-L2290 -
Err(..)
When we know that the block definietely does not exist.
is_pruned
should have a similar enum return value with similar check.
A call to have_state_at
before pin
should be removed. This use case is racy anyway. You can't do anything useful with the value returned by have_state_at
since some other thread my cause pruning to happen and the result is obsolete right after the call. The only use for have_state_at
should be for the cases when the result being up to date is not that important. E.g. when showing block status to the user.
Comments addressed
pin
is implemented a bit differently, because pin
need to pin the block according to the have_block
result immediately, so we can't return a MaybePin
to the caller to do further checking then pin again. I pass a closure to pin
instead, which will do the further checking inside pin
if needed.
Perhaps we can handle the MayHaveBlock
case in a simpler way. Currently, with the block number passing to RefWindow::have_block
we can load the exact single journal record from db and check if its hash matches the hash parameter (thanks to the block number we don't need to iterate over the whole queue to find the hash as the previous implementation does).
Since the sp_state_machine::Storage::get
may also result in a db read, this approach does not bring much cost but is much simpler.
pin is implemented a bit differently, because pin need to pin the block according to the have_block result immediately, so we can't return a MaybePin to the caller to do further checking then pin again. I pass a closure to pin instead, which will do the further checking inside pin if needed.
I was imagning that MaybePinned
actuall places the pin, and if the additional check fails, unpin
should be called. But closure works as well. pin
should be made to return a guard really, but that's out of scope of this PR.
Perhaps we can handle the MayHaveBlock case in a simpler way. Currently, with the block number passing to RefWindow::have_block we can load the exact single journal record from db and check if its hash matches the hash parameter (thanks to the block number we don't need to iterate over the whole queue to find the hash as the previous implementation does).
The difference is how much data needs to be queried from the DB. Single root trie node is small, but the journal record may contain a lot of keys. An alternative storage scheme would me more efficient, but I don't think it is worth the effort to change it now.
should we really put db as a field (i am a bit uncomfortable with the Clone bound on Metadb, but with current underlying type it is fine, just wouldn't want actual data being cloned by mistake)
The Clone bound is indeed quite strange for Metadb
, how about we remove this Clone bound and pass a D
instead of &D
, such that we don't need to do the cloning in RefWindow::new
but in client/db/src/lib.rs
where we can see it is wrapped by an Arc
the batch logic of the cache ( uncached_block counter) may not be the more direct, maybe switching to a simple fifo cache (I mean one that always add to cache).
Actually, the pruning window is a FIFO queue now, blocks are put to the back (import
) and pop out from the head (prune_one
and apply_pending
). The cache is only a slice of successive blocks at the head of the window (it is also a FIFO queue), to reduce the burden of keeping all block in memory thus allow longer state pruning window.
PS: now that we have DB in type param, function insert_block of state-db/src/lib.rs can just return D::Error (it is the last function using some error as function generic).
insert_block
actually not toughing the db at all, the error it may return is StateDbError
from non_canonical
insert_block actually not toughing the db at all, the error it may return is StateDbError from non_canonical might be the same error type, using D::Error was not really an issue when I tested.
Actually, the pruning window is a FIFO queue now, blocks are put to the back (import ) and pop out from the head (prune_one and apply_pending). The cache is only a slice of successive blocks at the head of the window (it is also a FIFO queue), to reduce the burden of keeping all block in memory thus allow longer state pruning window.
yes fifo was wrong wording from me, what I meant was in your code you give priority to have first out block over first in; opposite could be done. The fact that we force consecutive range makes it we are emptying back of vecdeque, batch refill it, emptying, and so on, so one idea would be to just keep index of block and not force consecutive range (avoid batch of query this way). But that is only for when the pruning range don't fit in memory.
so one idea would be to just keep index of block and not force consecutive range (avoid batch of query this way). But that is only for when the pruning range don't fit in memory.
Oh, the code emphasizes consecutive a lot is because blocks are pruning in consecutive order. The pruning window is a FIFO queue, new canonicalized block put in the back of the queue, when the queue's size is exceed the configured size, the block in the head is pop out and pruned until the size <= the configured size again. Since the blocks in the pruing window is only used for pruning, we only load the block into cache when it is gonna be pruned.
I think I looked in the wrong direction, what matters is to have one for the case where the window is in the default size (I believe it is 256 from https://github.com/paritytech/substrate/blob/a1a9b475c354d873f91135ed5fa028ac9ef6a2c4/client/state-db/src/lib.rs#L269 , should be exposed and reuse as a constant). I was looking at what happen when we got bigger window than the cache, and then just not having the active caching is also fine.
bot merge
Waiting for commit status.
@NingLin-P and @arkpar I am testing this. I can see the effect in memory but I can not query more than 4096 blocks still, I have filled https://github.com/paritytech/polkadot/issues/6378 while trying to understand what is going on.