Use SPI instead of Enum for VectorSimilarityFunctions
Description
This PR is to get feedback on the idea and any major changes required in the commit.
In this commit we are using Java SPI instead of ENUM to define VectorSimilarityFunction used for vector search. Current implementation in Lucene tightly couples ENUM to Lucene index i.e. enum values are directly used in field info which are stored in index and later used for reading purpose. This makes adding or removing any similarity function to the ENUM very difficult as removing any function from ENUM can change the ordinal value of the functions and cause mismatch of field while reading fields from the Lucene index. On the other hand Java SPI makes life easier by providing pluggable implementation of the similarity function. With SPI, we can extend or remove similarity function easily without the need for changing things at indexing and searching sides.
For backward compatibility, I have kept ordinals with the similarity functions which can be directly used for writing/reading fields to/from the index. For Lucene version >=10 we can avoid using ordinal values and directly use function name for reading/writing the index.
@benwtrent @uschindler @ChrisHegarty Could you please take a look, if you get a chance?
@Pulkitg64 +1 on the feature and functionality. I would like to recommend one thing here:
Can we add the reloading the SPIs functionality for VectorSimilarityFunctions just like we have for codecs. Ref: https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/Codec.java#L126-L137
this will ensure that vector similarity functions which are not inside Lucene can be created if anyone want to put a different similarity functions like l1, linf etc for doing similarity search.
Makes sense. Thanks @navneet1v for the suggestion.
I did kind of change before, and the added complexity and backwards compatibility concerns just didn't seem warranted. This is why the decision to do the scorer pluggability was added to the codecs.
How do you communicate to codecs the similarity kind?
For example, Vector quantization needs to know if the similarity is cosine. Do we do a "cosine".equals(similarity.name())? That is very fragile. What if some one does an "optimized cosine"? Name conflict or they just cant use vector quantization.
What if someone adds an SPI similarity that should be normalized before being quantized?
I don't see a way of doing this without leaking this internal dependency (normalization being required) into the similarity SPI, which is weird to me.
I just don't think this is feasible.
My old PR: https://github.com/apache/lucene/pull/13200
I have some comments, but this is not a final review. Just things that I stumbled upon on first walkthrough.
I will have no time to do a closer review soon, so please give me some time.
Thank you @uschindler for all the comments, I have tried to address them in the latest commit. Looking forward for more.
I did kind of change before, and the added complexity and backwards compatibility concerns just didn't seem warranted. This is why the decision to do the scorer pluggability was added to the codecs.
How do you communicate to codecs the similarity kind?
For example, Vector quantization needs to know if the similarity is cosine. Do we do a "cosine".equals(similarity.name())? That is very fragile. What if some one does an "optimized cosine"? Name conflict or they just cant use vector quantization.
What if someone adds an SPI similarity that should be normalized before being quantized?
I don't see a way of doing this without leaking this internal dependency (normalization being required) into the similarity SPI, which is weird to me.
I just don't think this is feasible.
Thank you @benwtrent for the feedback and sharing your PR. I didn't know, this attempt was already made before. The main motivation behind this change was to get rid of ENUM implementation which is tightly coupled to field-info. This has caused inconvenience in deprecating the COSINE function from the list. Not sure what's the best approach then. Will take a look at your PR and comments to understand more about this.
The main motivation behind this change was to get rid of ENUM implementation which is tightly coupled to field-info. This has caused inconvenience in deprecating the COSINE function from the list. Not sure what's the best approach then. Will take a look at your PR and comments to understand more about this.
This is not a very compelling reason supporting the proposed change. I agree with @benwtrent, there are several challenges that would need to be ironed out before we could consider moving this forward.
The recent addition of FlatVectorsScorer has certainly improved the extensibility in this area. For now at least, it appears to offer what is needed to plugin in new implementations.
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!
Wanted to touch base on this PR as it seems to have been stalled, mainly by me.
The only format that would support pluggable similarities would be Lucene99HnswVectorsFormat. Any of the quantized codecs would have to throw an exception on an unknown similarity name.
This now complicates a user's mental model support matrix. Having to consider not only codecs, but similarities, all of which are pluggable.
This inherit complexity of making things pluggable is why I think the implication of the "pluggable similarities" is "just make your own format".
However, I am not against moving away from enum and moving towards a nominal/id set of core interfaces. Enums are notoriously painful for BWC as removing one adjusts its "id" and now various edge's have to be smoothed out all over the place.
All this talk of a pluggable SPI for vector similarities spawned out of the complexities of adding fully BWC similarity functions and the difficulty of deprecating and moving on.
So, I propose:
- We deprecate cosine (as we already have) and remove it from being writable in v10
- Move away from enums to an id/nominal system for the similarities (what this PR could do)
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!