Introduces efSearch as a separate parameter in KNN{Byte:Float}VectorQuery
Description
efSearch is one of the parameters in HNSW algorithm which can help increase recall. efSearch value indicates the neighbors that the algorithm explores to get the nearest ones on L0 layer. Currently k is being used as the efSearch value and there isn't a way to distinguish these two values. This PR adds support for efSearch
CC: @jimczi, @benwtrent similar to #12551 except efSearch is static here.
I am not overly convinced that this is necessary.
I am not hard and fast against it, I just don't see the need.
For HNSW efSearch is a core parameters during search time. This is convenient for users to not have to have the logic to strip off top k values on their end.
Why not Lucene have the support for something that is a core parameter, when its cheap enough to have it so users have the option to choose how they want to use it?
For HNSW efSearch is a core parameters during search time. This is convenient for users to not have to have the logic to strip off top k values on their end.
I understand, but this is trivial to do yourself by collecting and rewriting the query directly.
Why not Lucene have the support for something that is a core parameter, when its cheap enough to have it so users have the option to choose how they want to use it?
While it is cheap, it is a further coupling of kNN to HNSW. Now the user interface itself when querying assumes "HNSW-esque" things. Instead of a simple "Give me k values" its now "give me k values with this particular HNSW parameter".
There has been talk of adding a "flat" codec to Lucene, that simply takes advantage of quantization and stores the vectors not in the HNSW graph. In that instance, what would efSearch mean?
There has been talk of adding a "flat" codec to Lucene, that simply takes advantage of quantization and stores the vectors not in the HNSW graph. In that instance, what would efSearch mean?
@benwtrent I really excited to see what are developing on the flat codec part. But coming back to ef_search as a parameter, I feel this is an essential parameter for HNSW. So, if we can abstract this and any other parameter that can come in future for any algorithm in a class SearchParameters or VectorSearchParameters it will solve this use case and also gives the flexibility to add new algorithm related search parameters later on.
We used to have this as a separate parameter, but after discussing, we realized this is completely equivalent to running search with larger k (=efSearch) and then discarding all but the desired number of hits. Given that, we chose to simplify the API in order to make it more generic and not expose this internal algorithm detail. This enables us to potentially switch to a different algorithm with different knobs without having to change the API.
If you're interested in looking at the history, see Lucene90OnHeapHnswGraph ( in backwards_codecs) and you can see we used to have this numSeed parameter which is the equivalent to what is being asked for here. But we removed it, because we found it wasn't needed.
Now the user interface itself when querying assumes "HNSW-esque" things.
I am wondering why this wasn't raised in #12551
There has been talk of adding a "flat" codec to Lucene, that simply takes advantage of quantization and stores the vectors not in the HNSW graph. In that instance, what would efSearch mean?
Apologies if my question sounds naive as I don't have context on "flat" codec, I am curious how will the user be able to choose that codec vs the current one which uses HNSW? Or will the user not have that option?. Are we planning to have a separate Query similar to KNN{Byte:Float}VectorQuery? In that case does it make sense to constraint the efsearch parameter to implementation of the abstract query?
Overall it seems like the implementation (Abstract class) is trying to prioritize making it generic over using params specific to algorithms. Doesn't this make the implementation rigid? There isn't a way to pass in additional parameters even if other algorithms are later used. For efSearch it might be easy enough to have users to have additional logic, assuming it will be for other implementations seems a bit optimistic. Shouldn't there be the flexibility for users to pass in parameters if the users want to while lucene providing defaults?
@benwtrent
if we can abstract this and any other parameter that can come in future for any algorithm in a class
SearchParametersorVectorSearchParameters
I thought about this and it isn't as simple as adding parameters. The current usage of k(=efsearch) is tied to TopKNNCollector. And if it needs to be generic, its doesn't seem to be possible to manipulate it until the code flow reaches the algorithm being used itself. Which can turn into an involved change compared to what is there currently in PR.
But we removed it, because we found it wasn't needed.
Is there a comment chain which I can look at to better understand this. Would be helpful if it is linked. Thanks!
@msokolov
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!
Is there a comment chain which I can look at to better understand this. Would be helpful if it is linked. Thanks!
I went back to old email threads and old github issues & PRs, and as far as I can tell, even in Lucene 9.0, we have never had an efSearch parameter unless @msokolov can help us here discover the origins :).
But the reasoning to me is simple:
- Its not necessary and without it things are simpler (less code and less API features is a good thing by itself)
- Leaking the internal storage mechanism in the query API directly is frustrating.
I am not gonna block this PR or work. But I don't think its strictly necessary.
Sorry you had to dig unsuccessfully! We used to call this parameter fanout, and it was removed here: https://github.com/apache/lucene/commit/9b5e23396092ea1d4cfb19c8a996b8fc118c33e8. The discussion about it was here and here.
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!
@benwtrent
Its not necessary and without it things are simpler (less code and less API features is a good thing by itself)
This shouldn't come at the cost of not providing the core parameters of a widely used algorithm
Leaking the internal storage mechanism in the query API directly is frustrating.
This is more of a compute mechanism with ability to improve recall and not the storage mechanism, it leverages how its stored but not a storage mechanism in itself. Not providing those is limiting the users their ability to improve the results. Forcing them to have a work around (in this case bumping up the k) is even more frustrating as there will be some amount of logic that needs to added on each client end.
@msokolov
Thanks for linking those! Most of the discussion around removing the parameter is based on the fact that its in on an abstract layer and cannot be reused.
I do believe that there is a possibility of having an implementation which can make it abstract enough for other algorithms and provide flexibility to have parameters. It doesn't have to be either-or.
To conclude this discussion, I wouldn't want to introduce a non-optimal way to accomplish something when it isn't a blocker. The optimal way unfortunately will require significant refactor to accomplish both the goals.
Closing this for now