bonsaidb icon indicating copy to clipboard operation
bonsaidb copied to clipboard

Refactor Documents and Views to better utilize Nebari

Open ecton opened this issue 2 years ago • 1 comments

Closes #76. Closes #225.

The primary goal of this PR is to improve the speed of view indexing (See #251 for more info) by tackling #76 in such a way that it can be executed safely without fsync.

Now that work has been done, the goals are slightly different:

  • Reduce Views from 3 Trees to 1 by making the view indexing system sequence based rather than invalidated-keys based.
  • Allow lazy views to execute without fsync, and eager views to execute fully synchronized in their transaction (although it might still be safe to be fsync-less in the transaction context, but more thought needs to be done in that direction).

Document Storage:

Documents are no longer serialized in a wrapper document type. Instead, the documents tree is now a versioned tree with an embedded index that stores the document's hash. The Revision's id is now the versioned tree's sequence_id.

This means that instead of simply pulling a document out of the database and deserializing it, we must pull the value and index out for a key and combine it with the key to create our document.

The other major change is introduced by the constraints of working within Nebari's modification system. Because we don't have access to the index for a key we're about to set, most of the logic for creating the OperationResult has been moved outside of the CompareSwap operation.

View Storage:

Views have been refactored to store the reduced value in Nebari through use of an embedded index. Instead of storing the entire ViewEntry structure in the view, we now only store the serialized Vec<Entrymapping>. The major change here is that Nebari will now reduce the stored index via the new ViewIndexer. The changes haven't been made to reduce/reduce_grouped yet to use Nebari's native reduce function -- but that is the inspiration for these changes.

When retrieving a view entry, we reconstruct the ViewEntry using the stored index to maintain compatibility with the existing code that worked with the ViewEntry structure.

These are a lot of remaining tasks:

  • [ ] Update reduce/reduce_grouped() to use Nebari's built-in reduction.
  • [x] Remove the invalidated_entries map and make the view mapper sequence based.
  • [ ] Embed the DocumentMap tree in the ViewEntries tree by creating a custom Root.
  • [ ] Once all the above are done, when the view indexer is running outside of a transaction (lazy views), the view can be persisted without fsync and be 100% safe to use due to the append-only file format.
  • [ ] Figure out if we want a new PR for the version migration work or to write it here.

ecton avatar May 09 '22 15:05 ecton

I've been starting work on a new file format that is my best theorycratt at something that could sit beneath Nebari -- https://github.com/khonsulabs/sediment. At its core is the basic idea that while fsync is happening, other transactions can proceed with updating the database, and then be batch-synced to confirm. This would make the fsyncs on each thread take on average the normal time for a sync, but now transactions will be able to be batched.

That core idea is actually somewhat compatible with the append-only format, except that only one writer can be modifying the tree at any given moment. I attempted to bring this idea into Nebari without the new project today, but I ran into another issue that Sediment wouldn't suffer from: multi-file synchronizations.

The reason my work today didn't do much is that each tree file is still being synced for each write. I don't have a good way to batch these operations at the moment, but it's one of the things Sediment aims to solve. I may come up with an idea in the meantime and try again -- but the more I think about Sediment the more I'm hopeful it will be able to be significantly better than an append-only format, so I probably still want to get there anyways.

ecton avatar May 29 '22 23:05 ecton