defradb icon indicating copy to clipboard operation
defradb copied to clipboard

Add support for branchable collections

Open AndrewSisley opened this issue 5 months ago • 0 comments

There is some high level info in an almanac doc here: https://source.almanac.io/folders/database-team-v1-q4qgzK/branchable-collections-Z0SmQmyPcgaYimjMlHuLZPEGM1s38Ewg?mention_id=4192&docView=Commenting

WIP thoughts:

  • We do not need a composite for each document change, instead we only need one for each transaction commit. This would reduce the cost (performance, size on disk) of a branchable collection considerably in a lot of use cases. Some event based code may need to change for this. The composite commit would contain the cids of all the changes to the collection within the transaction, instead of essentially one cid per composite.
    • It might be best to submit a txn commited event, and include txn.id on update events instead of aggregate update events into a set before publishing them.
    • Make sure the branchable collection is modified within the commit, not after it is commited - may need before/after on-success operations?.
    • collection-commit could be queried mid-transaction, so if doing that now, handle it (needs to mutate with every doc-update).
    • ~~Remember that the order of operations within a transaction is important, so make sure that the contents of the composite commit handles that. And remember that a document can be mutated multiple times within a single transaction (although the head cids will always differ).~~ doc-commit height handles this.
  • We can allow users to query branchable collections at a specific version like we do with documents (the cid query param). This does not however need to be implemented in the initial PR, is just something to bear in mind for now.
  • The existing CRDT related code is still overly complex and abstracted. The important file is internal/block/block.go (the Block struct is also used in P2P), strongly consider bypassing all the other noise, you may get complaints at review time. ~~Do not use DAGLink, use the ipld Link directly in the payload.~~
    • You can store the head cid in the headstore, but use a sensible Key, do not use the HeadStoreKey.
    • ~~Decoupling Block from DAGLink might be a pain due to the existing code structure.~~
    • ~~DAGLink.Name is only ever used for the "_head" magic value, and commit queries. "_head" links are the commit cids that this commit is based off of, there can be multiple if multiple blocks share the same height. Non-head values are duplicated and waste space as they are already stored within the other commit.~~
    • DAGLink removal was rejected by team.
  • Commit queries:
    • There is already a collectionID prop, use it. This prop can be changed to use the collection cid.
    • Do not store the collection cid in the block for field/doc level commits as it is local only and will take up space, you can rely on the one-one schema-collection relationship internally whilst presenting the cid at a field/doc commit level.
    • We can add a collection name prop later (#850)
    • Check existing gql documentation for props like fieldName and adjust them if needed
  • Testing this will increase the number of test cids, strongly consider removing the need to hardcode them first (CidIndex?)
  • ACP question - long term these need to be permissioned, hopefully can skip this for now (a document create is an update to the collection, so this could probably be handled by collection-level write permissions?).
    • Make sure adding a policy to a branchable collection errors, and make sure making a collection that has a policy branchable errors too.
  • The cid param in normal queries can return an error for now if given a collectionCID (document on gql description), long term this needs to work.
  • I can't think of any schema-migration problems that could come up with the current implementations, however there has been talk of making Lens migrations DAG-aware, which could introduce new problems. It might be worth adding a basic test or two in case they manage to catch anything silly in the future.
  • RE block design (size is important here)
    • optional removes overhead for block storage size on empty fields but is limited to certain types (not string)
    • Property name affects overhead! Longer names = bigger storage costs
### Tasks
- [ ] https://github.com/sourcenetwork/defradb/issues/3056
- [ ] https://github.com/sourcenetwork/defradb/issues/3085

AndrewSisley avatar Sep 19 '24 16:09 AndrewSisley