cosmos-sdk EPIC: Storage

Summary

Storage in the cosmos-sdk is where we spend the most amount of time in execution. The original store package was written many years ago and left to ferment. As the needs of blockchains expanded and we gained better insight into how applications using the cosmos-sdk reads and writes we have learned how to better design the storage layer.

Problem Definition

Cosmos-sdk spends immense amounts of time at the storage layer and this is causing many bottlenecks.

Proposal

We need to refactor many parts of the storage layer in order to get somewhere where the users of the cosmos-sdk can get the highest performance for their applications.

Things we can do today:

[ ] Refactor Cache kv store
[ ] Refactor multistore
[ ] Write benchmarks for many things in the store package
[ ] Profile current store and bring performance improvements to users.
[ ] Refactor key structure in IAVL to take advantage of data locality.

Things to do in the future:

[ ] Refactor the design of storage to do something along the lines of ADR-40
[ ] Use a different tree structure
[ ] Use a different database

Previously there was work around a new storage layer, store/v2 and adr-40, this work and design is currently under review from the core sdk team.

Work Breakdown

Phase 1

[ ] #13223
- [ ] cachekv
- [ ] multistore
[ ] Refactor store package (API cleaning and improvement)
- [ ] cachekv store docs and refactor to provide strong guarantees and more performance
  - It mixes so many jobs, and does a poor job at it, especially around iterator
  - Someone ground up rewriting it, and writing out what the guarantees we should want, will net save so much headache
  - (Make parallel iterators explicitly clear in how expectations are made of it)
- [ ] #13224

ref: https://github.com/cosmos/cosmos-sdk/issues/7101, https://github.com/cosmos/cosmos-sdk/issues/6687

Phase 2

[ ] POC to prove iavl key format lends to higher performance
[ ] Refactor IAVL key layout to operate within data locality
[ ] Tests and benchmarks
[ ] Migration for current state

Phase 3

[ ] evaluate different data structures
[ ] evaluate adr040 design and implementation on how to progress to separating SS and SC
[ ] .... TBD

Aug 22 '22 06:08 tac0turtle

Thanks for writing up this up @marbar3778! Seems like a pretty good summary to me. Just to highlight and reiterate, things that I think are paramount:

Single logical DB utilization for atomic commits (giving optionality for separate logical DBs when desired by a module).
Refactor store APIs and how stores are registered

One thing I'm still hazy on is IAVL vs some future better accumulator tree structure (e.g. JMT). How much time do we want to spend on IAVL and say refactoring the key layout vs just switching to something better like the JMT?

Aug 22 '22 14:08 alexanderbez

One thing I'm still hazy on is IAVL vs some future better accumulator tree structure (e.g. JMT). How much time do we want to spend on IAVL and say refactoring the key layout vs just switching to something better like the JMT?

Talking with @ValarDragon he believes the change would take 1 week or less of work after a repo cleanup is upstreamed to our version. The JMT work would take much longer. This could be an easy win until we land JMT or something similar.

Would you want to dive into the key format change?

Aug 25 '22 14:08 tac0turtle

Awesome!

Would you want to dive into the key format change?

Sure! From what I've read in the JMT paper, we can adopt a similar tactic, i.e. keys take the form of version || nibble path (e.g. 45 || 0010101) where || denotes concatenation , I'm not sure IAVL uses bit nibble paths for keys, but it is binary and thus we can a similar/identical format. Correct me if I'm wrong @ValarDragon.

Aug 25 '22 20:08 alexanderbez

are we planning on switching to the JMT and refactoring the whole storage?

Aug 26 '22 08:08 adu-crypto

For current store v2 work, besides the SMT itself compared with IAVL tree, I personally think the new store v2 comes with two major differences from store v1:

store v2 uses backing db transaction to commit and control version, which means every time store is committed, db tx is discarded and root store needs to be reloaded next time for read/write. While store v1 manages the version by the tree itself and don’t need to discard tx and reload store after commitment.
store v1 iavlStore caches store nodes in memory, I don’t think store v2 caches smt/databucket/indexbucket in memory

Aug 26 '22 08:08 adu-crypto

are we planning on switching to the JMT and refactoring the whole storage?

The strategy will be as follows (in order):

Refactor store package (API cleaning and improvement -- I'd like to use signal logical DB)
Relatively low-lift/small refactor & improvement of IAVL
Eventual implementation of JMT (~q1 2023)

Aug 26 '22 17:08 alexanderbez

As for the current v2, I think there might be a chance that is actually discarded -- I'll let @marbar3778 chime in on that

Aug 26 '22 17:08 alexanderbez

V2 is still being evaluated. Much of the team has been in and out the past couple weeks so we have been unable to review in a timely manner. There are some concerns taken in the current design and the performance of v2 as it currently stands is not drastically more than v1 with iavl.

With a week or two of work on iavl and another week or two in the store package there is a high probability that v1+iavl will become more performant than v2+smt. For now we are opting to move forward with v1+iavl until we fully evaluate the design decisions of v2+smt.

Aug 29 '22 12:08 tac0turtle

The strategy will be as follows (in order):

Refactor store package (API cleaning and improvement -- I'd like to use signal logical DB)

Relatively low-lift/small refactor & improvement of IAVL

Eventual implementation of JMT (~q1 2023)

lets start with the first two and add in specs/docs for the store package. Do you want to lead it @alexanderbez?

Aug 29 '22 12:08 tac0turtle

The strategy will be as follows (in order):

Refactor store package (API cleaning and improvement -- I'd like to use signal logical DB)

Relatively low-lift/small refactor & improvement of IAVL

Eventual implementation of JMT (~q1 2023)

lets start with the first two and add in specs/docs for the store package. Do you want to lead it @alexanderbez?

yes ser

Aug 30 '22 05:08 alexanderbez

FYI, v2 store looks not bad in terms of db size reduction: https://github.com/evmos/ethermint/pull/1304#issuecomment-1246128879 But this benchmark didn't count in the cost of snapshot versioning.

Sep 14 '22 01:09 yihuang

The low level db snapshot/checkpoints hard-links the db files, it seems pretty costly in terms of db size intuitively (need some tests to confirm).

https://hackmd.io/@2O2cXDfdQpijemGc_vlHCA/SkXxuTAli I wrote a proposal to store historical versions explicitly in a similar way to erigon, to not rely on low level db checkpoints.

Sep 14 '22 03:09 yihuang

FYI, v2 store looks not bad in terms of db size reduction: evmos/ethermint#1304 (comment) But this benchmark didn't count in the cost of snapshot versioning.

I agree that if we really want to make the blockchain network more usable, we have to take the db size of archive/full node into consideration.

Sep 14 '22 06:09 adu-crypto

Seems like the size comparison is of smt vs iavl. Smt having a smaller footprint is known, this is not part of the worry, the worry is that the semantics of iavl and store v1 are not known and a new implementation was done without this understanding.

Did the benchmark account for the historical versions on disk or only what is in the tree? The data in smt is not historical but only current so it makes sense that its smaller

Sep 14 '22 08:09 tac0turtle

The low level db snapshot/checkpoints hard-links the db files, it seems pretty costly in terms of db size intuitively (need some tests to confirm).

not sure why the checkpoints consumes a lot of disk space but definitely need to be checked out https://github.com/cosmos/cosmos-sdk/issues/12251#issue-1271071521

Sep 15 '22 21:09 JayT106

An alternative db design for historical states, moving the discussion here.

reduce archive node db size dramatically (10x)
not consensus breaking

Oct 06 '22 03:10 yihuang

closing this epic for now, @alexanderbez when you have a chance could you create a new one for the upcoming storage work?

Jun 28 '23 09:06 tac0turtle

cosmos-sdk cosmos-sdk copied to clipboard

EPIC: Storage

Summary

Problem Definition

Proposal

Work Breakdown

Phase 1

Phase 2

Phase 3

cosmos-sdk
cosmos-sdk copied to clipboard