tantivy
tantivy copied to clipboard
Consider Capability Based API
We should consider to move to a capability based API. Instead of checking against INDEX_FORMAT_VERSION, the current tantivy version would have a list of capabilities and the index is stored with a list of required capabilities.
E.g. when we add a new configurable compression mechanism for the posting list. We associate the capability PostinglistCompressionV1 when storing an index which uses this capability. Old versions of tantivy would fail to open this index, but if the feature is not used, they may still be able to open it. So this would maximize compatibility.
On the other side, we could also remove capabilities from tantivy, for features we don't want to support anymore. They could be listed in an UnsupportedCapabilities List, for nice error messages.
Capability lists can get big, so they should probably not JSON serialized, but binary encoded.
Contra
There is some overhead to managing and identifying capabilities.
Pro
In a landscape with indexers and search instances of tantivy running, an upgrade would currently require to update all search instances before upgrading indexers. This requirement would be relaxed as long as the new indexers would not use new capabilities until all search instances are updated.
I'm ok with this change if you think it will make our life easier.
Assuming we change the codec of the posting list... Do we express that with as many capabilities like postings_codec_1 and postings_codec_2?
Yes, every change in the format should be expressed as a capability. Capabilities can also cover other parts, like search features.
I think it will make consumption easier, but not necessarily tantivy development. It also allows to phase out features without hard cuts to version.
Should we postpone this for the moment then?
I think if we add this early we can ease out some quirks until it gets serious, and it may help with experimenting without breaking everything
Some sort of backwards compatibility or in-place upgrade would be very useful.
I'm struggling with tantivy on lib.rs, because every version bump requires me to reindex all crates. With 100K crates now this is very slow. I've recently added cache expiration for GitHub data, and found out the hard way that after the caches expired, I can't even reindex all the crates now due to GitHub API rate limits.
@kornelski Thanks for the feedback. We are working on a concept for a more stable format. The next version will be breaking again though, since too many fundamentals will change (e.g. null handling). Sorry for the inconvenience in advance.
An upgrade tool is not planned currently, but it's probably possible to hack something together. I would help with providing the format changes, if someone would be interested.