dolt
dolt copied to clipboard
Feature: Secondary indexes only at HEAD
Customer Problem: Secondary indexes on historical data versions may not be needed by customers, but they still consume space in the database. Customers may want secondary indexes to only exist at the tip of each branch so that the index data can be used for the current data, but doesn't result in storage use for any older versions.
Today, customers can reduce secondary index storage space by using dolt filter-branch to process commits and remove a secondary index from all historical commits, which generates new commits (i.e. new commit IDs). This is a manual process and generating new commits may be disruptive for some use cases, such as when a history is shared between multiple people or multiple sql-servers.
A more automatic way for customers to declare they want this behavior would be convenient, but has several large challenges:
- How to remove secondary index data from an existing commit? Commit IDs are computed based on the stored data (including the secondary index data) so changing an existing commit to remove secondary index data would invalidate its commit ID.
- A branch's HEAD can be changed in many ways. For example, a customer could
dolt_resetto an older commit, and we would need to generate the secondary index data from scratch, which could be expensive in terms of IO to reread the full table and rewrite a new index, and it terms of latency to build the index.
To get around some of those problems, it seems like we would need to store the secondary index data for these "head-only indexes" as data outside of the commit graph, likely local to the sql-server instance and regenerated from scratch on clones/forks.
Another possible alternative for this scenario is to ask forks of the main database to add the secondary index, instead of adding the index on the main instance itself. This means the secondary index will not exist in the commits from the main instance that get pulled into the fork, but they will exist in the unique merge commits created on the fork when pulling those changes from the main instance. Historical merge commits will still contain the secondary index data, but depending on the usage patterns, this approach may result in fewer commits storing the secondary index data. Additionally, if the index is only needed by a fork, then this also has the advantage of allowing the consumer of the index data to cover the storage needs. This approach won't work for all cases, but is a useful technique to know about, and is described in more detail on the DoltHub blog.
I also would add one more related usecase
- to have branch-only indexes, e. g., for example, to have indexes only for
mainanddevelopbranches
How to remove secondary index data from an existing commit? Commit IDs are computed based on the stored data (including the secondary index data) so changing an existing commit to remove secondary index data would invalidate its commit ID.
One option is to keep the commit ID the same and just garbage collect the Index chunks. That is, change how garbage collection and validation work so that it's okay to garbage collect a chunk if the only reference to it is a non-HEAD secondary index, and it's a valid chunk store if a non-HEAD secondary index is missing.
There may be other cases where we can apply this too: structures that can be deterministically recreated don't have to actually be stored in the chunk store. Continuing the use the same commit ID and keeping the index hash (and just allowing that chunk to be absent from the chunk store) also means that we can verify that the regenerated index matches the one that was garbage collected.