substrate icon indicating copy to clipboard operation
substrate copied to clipboard

Introduce trie level cache and remove state cache

Open bkchr opened this issue 3 years ago • 7 comments

This pr introduces a trie level cache and removes the old state cache. As seen on a lot of parachains the state cache was/is buggy: https://github.com/paritytech/cumulus/issues/474 We looked into it, but couldn't really identify the bug(s). Besides these bugs collator were also building blocks without using any cache: https://github.com/paritytech/substrate/issues/9770 This means that every state access has gone to the database and this is costly. If you look at the benchmarks below, reading a single key is 10-8x slower than reading from the cache.

So, we started to write a new cache on the trie level. This cache would "automatically" solve the invalidation of old nodes because each node is addressed by its hash, which changes when the content changes. This was the first version of the pull request, a trie node cache. But benchmarking has shown that this trie node cache wasn't enough, the state cache was still faster. The reason for this is that the state cache has in optimum a complexity of O(1) for accessing cached data, while with the trie node cache we still need to walk down the trie which is O(log N). To bring both caches closer to each other, a value cache was introduced. This value cache also providing O(1) in the optimum case. To not run into the same problems as the state cache, the key to this cache is (storage_root, key). Using this key we ensure that we don't accidentally access outdated data, resulting in the problems as seen with the Parachains.

While introducing the trie cache, we also needed to change the way we record storage proofs. Before this pr there wasn't any cache when recording as already explained above. The recorder was between the trie and the database, recording anything that was read from the database. However, that wouldn't work with the cache as it sits "inside the trie" intercepting calls to the database as well. So, the recorder also needed to move there. The recorder also needed to change for when generating the final storage proof. Before we just collected all accessed nodes and put them together into the storage proof. However, now with the value cache we may not even access a trie node to access data in the trie and thus, we can not record a trie node. To solve this we keep track of all keys accessed in this value cache. When the storage proof is generated, we will need to walk down the trie to record all nodes to access these values. However, as we only do this once and the trie nodes are already cached (otherwise the value cache wouldn't have the data) it will not take that much time as always accessing the database as before.

To compare the different caches, there was the state-access benchmark added. With the following results:

Benchmark State cache Trie cache No cache
Reading the entire state (1Mio keys) 372ms 524ms 6.2s
Reading a single key 386ns 561ns 4.4us
Hashing :code 327ns 521ns 4.9us

As we can see the trie cache is around 1.4x slower than the state cache. This is good enough for the first implementation ;) Especially as on testing syncing speed we don't have seen any performance impacts which indicates that it is not IO bound at the moment.

This pr requires the following pr to be merged before and released: https://github.com/paritytech/trie/pull/157

Closes: https://github.com/paritytech/substrate/issues/9770 Closes: https://github.com/paritytech/substrate/issues/9769 Closes: https://github.com/paritytech/cumulus/issues/607 Closes: https://github.com/paritytech/substrate/issues/9320 Closes: https://github.com/paritytech/substrate/issues/9697

CLI changes

This removes the --state-cache parameter and adds the new --trie-cache-size cli parameter. Parachain operators should now be able to just drop --state-cache 0 and are not required to add the --trie-cache-size as the cache is enabled by default.

bkchr avatar May 12 '22 18:05 bkchr

  1. It'd be nice to also have a multithreaded benchmark which would access the cache in parallel and see how this new implementation compares to the old one.
  2. Have you tried fuzzing this code? The code's a lot nicer than the old state cache so it might not find anything, but still, there's a lot of code and it is critical that it is correct. If not I could probably throw a few testcases at a fuzzer for you and see if it finds anything.

koute avatar May 13 '22 11:05 koute

2. Have you tried fuzzing this code? The code's a lot nicer than the old state cache so it might not find anything, but still, there's a lot of code and it is critical that it is correct. If not I could probably throw a few testcases at a fuzzer for you and see if it finds anything.

I didn't fuzz it, but I'm also no "fuzzing expert". So, if you want to push some fuzzing tests, I would be thankful.

bkchr avatar May 13 '22 13:05 bkchr

I didn't fuzz it, but I'm also no "fuzzing expert". So, if you want to push some fuzzing tests, I would be thankful.

Okay; I'll see what I can do!

koute avatar May 13 '22 13:05 koute

This pr introduces a trie level cache and removes the old state cache. As seen on a lot of parachains the state cache was/is buggy: https://github.com/paritytech/cumulus/issues/474

Is this issue really related? It does not mention any state cache "bugs".

arkpar avatar May 13 '22 18:05 arkpar

This pr introduces a trie level cache and removes the old state cache. As seen on a lot of parachains the state cache was/is buggy: paritytech/cumulus#474

Is this issue really related? It does not mention any state cache "bugs".

Ohh yeah, that was the wrong one :see_no_evil: I updated the issues in the pr description.

bkchr avatar May 13 '22 18:05 bkchr

Hey, is anyone still working on this? Due to the inactivity this issue has been automatically marked as stale. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jul 22 '22 02:07 stale[bot]

not stale

koute avatar Jul 22 '22 05:07 koute

I also notice that state-machine/src/proving_backend.rs is not linked anymore so file could be removed (maybe some test could be kept or were already copied out of it).

Ahh probably some merge fuck up. I already had removed the file. I also copied the tests.

bkchr avatar Aug 12 '22 08:08 bkchr

The CI pipeline was cancelled due to failure one of the required jobs. The job name - cargo-check-benches The job logs - https://gitlab.parity.io/parity/mirrors/substrate/-/jobs/1752658

paritytech-cicd-pr avatar Aug 17 '22 12:08 paritytech-cicd-pr

The CI pipeline was cancelled due to failure one of the required jobs. The job name - cargo-check-benches The job logs - https://gitlab.parity.io/parity/mirrors/substrate/-/jobs/1752703

paritytech-cicd-pr avatar Aug 17 '22 12:08 paritytech-cicd-pr

The CI pipeline was cancelled due to failure one of the required jobs. The job name - cargo-check-benches The job logs - https://gitlab.parity.io/parity/mirrors/substrate/-/jobs/1753071

paritytech-cicd-pr avatar Aug 17 '22 14:08 paritytech-cicd-pr

The CI pipeline was cancelled due to failure one of the required jobs. The job name - cargo-check-subkey The job logs - https://gitlab.parity.io/parity/mirrors/substrate/-/jobs/1754036

paritytech-cicd-pr avatar Aug 17 '22 21:08 paritytech-cicd-pr

The CI pipeline was cancelled due to failure one of the required jobs. The job name - cargo-check-benches The job logs - https://gitlab.parity.io/parity/mirrors/substrate/-/jobs/1754034

paritytech-cicd-pr avatar Aug 17 '22 21:08 paritytech-cicd-pr

The CI pipeline was cancelled due to failure one of the required jobs. The job name - cargo-check-benches The job logs - https://gitlab.parity.io/parity/mirrors/substrate/-/jobs/1754542

paritytech-cicd-pr avatar Aug 18 '22 08:08 paritytech-cicd-pr

bot merge

bkchr avatar Aug 18 '22 18:08 bkchr

Error: Statuses failed for bd972a282c76fda2010baa3353c5ea0dac6e796a

bot merge

bkchr avatar Aug 18 '22 18:08 bkchr