Database API architecture.
What is wrong
It seems our database wrapping architecture is starting to break down in some ways. Recent work that @carver has been doing has uncovered a number of needed units of functionality that have caused the complexity level of some of the database wrappers to grow beyond what may be reasonable to maintain.
How can it be fixed.
Not clear.
Whatever our solution needs to have the following functionality (please add to this list)
- batching of writes
- journaling with checkpoints
- trie functionality (state roots)
- committing without applying deletes
- https://github.com/ethereum/py-evm/blob/0cd3ebac9c7c336b07f2f52c52b069fbe400bcef/eth/db/batch.py#L48
- ability to read through to the underlying database (@carver should clarify this one)
- generation of diffs
- https://github.com/ethereum/py-evm/blob/0cd3ebac9c7c336b07f2f52c52b069fbe400bcef/eth/db/journal.py#L235
- atomic reads and writes
- https://github.com/ethereum/py-evm/blob/0cd3ebac9c7c336b07f2f52c52b069fbe400bcef/eth/db/atomic.py#L49
Things that I think might be code smells or in general result in increased complexity.
- Functionality flags like
apply_deletesorread_through_deletes.- These require us to test all combinations of various flags.
- Ability to access the underlying wrapped databases.
- If we stick with the wrapping API it should be hard to get at the underlying database.
Things that I think might help us test these better.
- Simple DSL for declarative testing.
- State machine based
hypothesistests to better define and test the rules for each database.
Ideas:
- Does a middleware pattern work for this? It would keep each database from having access to the underlying database directly, allowing us to only expose it via a well defined API.
- applying only additive changes (don't apply deletes)
(to be explicit: also applies updates, although that happens quite rarely since most data is content-addressed)
- ability to read through to the underlying database (@carver should clarify this one)
This prime example when this comes up is when you are mutating a trie with pruning on. The pruning deletes all nodes no longer referenced when the trie is updated. This is good for all the intermediate tries, but bad for the trie at the state root before the updates began. We want to always be able to continue to reference those. So we want to have some concept of only deleting intermediate nodes.
- One option is to make sure to
squash_changes()for all changes to the trie - Another is to add a
HexaryTrie.batch_update()that only starts regenerating hashes after all the new data changes. This would be a potentially large performance boost, and would also give us the concept of dropping all intermediate trie nodes for free.
Ability to access the underlying wrapped databases.
- If we stick with the wrapping API it should be hard to get at the underlying database.
Yeah, a few of these APIs have evolved to the point where they aren't really a wrapper anymore, it's just a database that has a spigot to dump into another database on demand. (eg~ BatchDB.commit_to())
- Does a middleware pattern work for this? It would keep each database from having access to the underlying database directly, allowing us to only expose it via a well defined API.
I think the only way it helps is by showing that we can't (easily) get it done without direct access to the underlying, so maybe we should drop the wrapping concept altogether.