Introduce trie level cache and remove state cache
This pr introduces a trie level cache and removes the old state cache. As seen on a lot of parachains the state cache was/is buggy: https://github.com/paritytech/cumulus/issues/474 We looked into it, but couldn't really identify the bug(s). Besides these bugs collator were also building blocks without using any cache: https://github.com/paritytech/substrate/issues/9770 This means that every state access has gone to the database and this is costly. If you look at the benchmarks below, reading a single key is 10-8x slower than reading from the cache.
So, we started to write a new cache on the trie level. This cache would "automatically" solve the invalidation of old nodes because each node is addressed by its hash, which changes when the content changes. This was the first version of the pull request, a trie node cache. But benchmarking has shown that this trie node cache wasn't enough, the state cache was still faster. The reason for this is that the state cache has in optimum a complexity of O(1) for accessing cached data, while with the trie node cache we still need to walk down the trie which is O(log N). To bring both caches closer to each other, a value cache was introduced. This value cache also providing O(1) in the optimum case. To not run into the same problems as the state cache, the key to this cache is (storage_root, key). Using this key we ensure that we don't accidentally access outdated data, resulting in the problems as seen with the Parachains.
While introducing the trie cache, we also needed to change the way we record storage proofs. Before this pr there wasn't any cache when recording as already explained above. The recorder was between the trie and the database, recording anything that was read from the database. However, that wouldn't work with the cache as it sits "inside the trie" intercepting calls to the database as well. So, the recorder also needed to move there. The recorder also needed to change for when generating the final storage proof. Before we just collected all accessed nodes and put them together into the storage proof. However, now with the value cache we may not even access a trie node to access data in the trie and thus, we can not record a trie node. To solve this we keep track of all keys accessed in this value cache. When the storage proof is generated, we will need to walk down the trie to record all nodes to access these values. However, as we only do this once and the trie nodes are already cached (otherwise the value cache wouldn't have the data) it will not take that much time as always accessing the database as before.
To compare the different caches, there was the state-access benchmark added. With the following results:
| Benchmark | State cache | Trie cache | No cache |
|---|---|---|---|
| Reading the entire state (1Mio keys) | 372ms | 524ms | 6.2s |
| Reading a single key | 386ns | 561ns | 4.4us |
Hashing :code |
327ns | 521ns | 4.9us |
As we can see the trie cache is around 1.4x slower than the state cache. This is good enough for the first implementation ;) Especially as on testing syncing speed we don't have seen any performance impacts which indicates that it is not IO bound at the moment.
This pr requires the following pr to be merged before and released: https://github.com/paritytech/trie/pull/157
Closes: https://github.com/paritytech/substrate/issues/9770 Closes: https://github.com/paritytech/substrate/issues/9769 Closes: https://github.com/paritytech/cumulus/issues/607 Closes: https://github.com/paritytech/substrate/issues/9320 Closes: https://github.com/paritytech/substrate/issues/9697
CLI changes
This removes the --state-cache parameter and adds the new --trie-cache-size cli parameter. Parachain operators should now be able to just drop --state-cache 0 and are not required to add the --trie-cache-size as the cache is enabled by default.
- It'd be nice to also have a multithreaded benchmark which would access the cache in parallel and see how this new implementation compares to the old one.
- Have you tried fuzzing this code? The code's a lot nicer than the old state cache so it might not find anything, but still, there's a lot of code and it is critical that it is correct. If not I could probably throw a few testcases at a fuzzer for you and see if it finds anything.
2. Have you tried fuzzing this code? The code's a lot nicer than the old state cache so it might not find anything, but still, there's a lot of code and it is critical that it is correct. If not I could probably throw a few testcases at a fuzzer for you and see if it finds anything.
I didn't fuzz it, but I'm also no "fuzzing expert". So, if you want to push some fuzzing tests, I would be thankful.
I didn't fuzz it, but I'm also no "fuzzing expert". So, if you want to push some fuzzing tests, I would be thankful.
Okay; I'll see what I can do!
This pr introduces a trie level cache and removes the old state cache. As seen on a lot of parachains the state cache was/is buggy: https://github.com/paritytech/cumulus/issues/474
Is this issue really related? It does not mention any state cache "bugs".
This pr introduces a trie level cache and removes the old state cache. As seen on a lot of parachains the state cache was/is buggy: paritytech/cumulus#474
Is this issue really related? It does not mention any state cache "bugs".
Ohh yeah, that was the wrong one :see_no_evil: I updated the issues in the pr description.
Hey, is anyone still working on this? Due to the inactivity this issue has been automatically marked as stale. It will be closed if no further activity occurs. Thank you for your contributions.
not stale
I also notice that state-machine/src/proving_backend.rs is not linked anymore so file could be removed (maybe some test could be kept or were already copied out of it).
Ahh probably some merge fuck up. I already had removed the file. I also copied the tests.
The CI pipeline was cancelled due to failure one of the required jobs. The job name - cargo-check-benches The job logs - https://gitlab.parity.io/parity/mirrors/substrate/-/jobs/1752658
The CI pipeline was cancelled due to failure one of the required jobs. The job name - cargo-check-benches The job logs - https://gitlab.parity.io/parity/mirrors/substrate/-/jobs/1752703
The CI pipeline was cancelled due to failure one of the required jobs. The job name - cargo-check-benches The job logs - https://gitlab.parity.io/parity/mirrors/substrate/-/jobs/1753071
The CI pipeline was cancelled due to failure one of the required jobs. The job name - cargo-check-subkey The job logs - https://gitlab.parity.io/parity/mirrors/substrate/-/jobs/1754036
The CI pipeline was cancelled due to failure one of the required jobs. The job name - cargo-check-benches The job logs - https://gitlab.parity.io/parity/mirrors/substrate/-/jobs/1754034
The CI pipeline was cancelled due to failure one of the required jobs. The job name - cargo-check-benches The job logs - https://gitlab.parity.io/parity/mirrors/substrate/-/jobs/1754542
bot merge
Error: Statuses failed for bd972a282c76fda2010baa3353c5ea0dac6e796a
bot merge