lucene icon indicating copy to clipboard operation
lucene copied to clipboard

Introduces efSearch as a separate parameter in KNN{Byte:Float}VectorQuery

Open shatejas opened this issue 1 year ago • 12 comments

Description

efSearch is one of the parameters in HNSW algorithm which can help increase recall. efSearch value indicates the neighbors that the algorithm explores to get the nearest ones on L0 layer. Currently k is being used as the efSearch value and there isn't a way to distinguish these two values. This PR adds support for efSearch

shatejas avatar May 23 '24 21:05 shatejas

CC: @jimczi, @benwtrent similar to #12551 except efSearch is static here.

shatejas avatar May 23 '24 21:05 shatejas

I am not overly convinced that this is necessary.

I am not hard and fast against it, I just don't see the need.

benwtrent avatar May 29 '24 17:05 benwtrent

For HNSW efSearch is a core parameters during search time. This is convenient for users to not have to have the logic to strip off top k values on their end.

Why not Lucene have the support for something that is a core parameter, when its cheap enough to have it so users have the option to choose how they want to use it?

shatejas avatar May 29 '24 17:05 shatejas

For HNSW efSearch is a core parameters during search time. This is convenient for users to not have to have the logic to strip off top k values on their end.

I understand, but this is trivial to do yourself by collecting and rewriting the query directly.

Why not Lucene have the support for something that is a core parameter, when its cheap enough to have it so users have the option to choose how they want to use it?

While it is cheap, it is a further coupling of kNN to HNSW. Now the user interface itself when querying assumes "HNSW-esque" things. Instead of a simple "Give me k values" its now "give me k values with this particular HNSW parameter".

There has been talk of adding a "flat" codec to Lucene, that simply takes advantage of quantization and stores the vectors not in the HNSW graph. In that instance, what would efSearch mean?

benwtrent avatar Jun 04 '24 20:06 benwtrent

There has been talk of adding a "flat" codec to Lucene, that simply takes advantage of quantization and stores the vectors not in the HNSW graph. In that instance, what would efSearch mean?

@benwtrent I really excited to see what are developing on the flat codec part. But coming back to ef_search as a parameter, I feel this is an essential parameter for HNSW. So, if we can abstract this and any other parameter that can come in future for any algorithm in a class SearchParameters or VectorSearchParameters it will solve this use case and also gives the flexibility to add new algorithm related search parameters later on.

navneet1v avatar Jun 04 '24 22:06 navneet1v

We used to have this as a separate parameter, but after discussing, we realized this is completely equivalent to running search with larger k (=efSearch) and then discarding all but the desired number of hits. Given that, we chose to simplify the API in order to make it more generic and not expose this internal algorithm detail. This enables us to potentially switch to a different algorithm with different knobs without having to change the API.

If you're interested in looking at the history, see Lucene90OnHeapHnswGraph ( in backwards_codecs) and you can see we used to have this numSeed parameter which is the equivalent to what is being asked for here. But we removed it, because we found it wasn't needed.

msokolov avatar Jun 10 '24 14:06 msokolov

Now the user interface itself when querying assumes "HNSW-esque" things.

I am wondering why this wasn't raised in #12551

There has been talk of adding a "flat" codec to Lucene, that simply takes advantage of quantization and stores the vectors not in the HNSW graph. In that instance, what would efSearch mean?

Apologies if my question sounds naive as I don't have context on "flat" codec, I am curious how will the user be able to choose that codec vs the current one which uses HNSW? Or will the user not have that option?. Are we planning to have a separate Query similar to KNN{Byte:Float}VectorQuery? In that case does it make sense to constraint the efsearch parameter to implementation of the abstract query?

Overall it seems like the implementation (Abstract class) is trying to prioritize making it generic over using params specific to algorithms. Doesn't this make the implementation rigid? There isn't a way to pass in additional parameters even if other algorithms are later used. For efSearch it might be easy enough to have users to have additional logic, assuming it will be for other implementations seems a bit optimistic. Shouldn't there be the flexibility for users to pass in parameters if the users want to while lucene providing defaults?

@benwtrent

shatejas avatar Jun 10 '24 18:06 shatejas

if we can abstract this and any other parameter that can come in future for any algorithm in a class SearchParameters or VectorSearchParameters

I thought about this and it isn't as simple as adding parameters. The current usage of k(=efsearch) is tied to TopKNNCollector. And if it needs to be generic, its doesn't seem to be possible to manipulate it until the code flow reaches the algorithm being used itself. Which can turn into an involved change compared to what is there currently in PR.

shatejas avatar Jun 10 '24 18:06 shatejas

But we removed it, because we found it wasn't needed.

Is there a comment chain which I can look at to better understand this. Would be helpful if it is linked. Thanks!

@msokolov

shatejas avatar Jun 10 '24 18:06 shatejas

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

github-actions[bot] avatar Jun 25 '24 00:06 github-actions[bot]

Is there a comment chain which I can look at to better understand this. Would be helpful if it is linked. Thanks!

I went back to old email threads and old github issues & PRs, and as far as I can tell, even in Lucene 9.0, we have never had an efSearch parameter unless @msokolov can help us here discover the origins :).

But the reasoning to me is simple:

  • Its not necessary and without it things are simpler (less code and less API features is a good thing by itself)
  • Leaking the internal storage mechanism in the query API directly is frustrating.

I am not gonna block this PR or work. But I don't think its strictly necessary.

benwtrent avatar Jun 25 '24 11:06 benwtrent

Sorry you had to dig unsuccessfully! We used to call this parameter fanout, and it was removed here: https://github.com/apache/lucene/commit/9b5e23396092ea1d4cfb19c8a996b8fc118c33e8. The discussion about it was here and here.

msokolov avatar Jun 25 '24 11:06 msokolov

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

github-actions[bot] avatar Jul 10 '24 00:07 github-actions[bot]

@benwtrent

Its not necessary and without it things are simpler (less code and less API features is a good thing by itself)

This shouldn't come at the cost of not providing the core parameters of a widely used algorithm

Leaking the internal storage mechanism in the query API directly is frustrating.

This is more of a compute mechanism with ability to improve recall and not the storage mechanism, it leverages how its stored but not a storage mechanism in itself. Not providing those is limiting the users their ability to improve the results. Forcing them to have a work around (in this case bumping up the k) is even more frustrating as there will be some amount of logic that needs to added on each client end.

@msokolov

Thanks for linking those! Most of the discussion around removing the parameter is based on the fact that its in on an abstract layer and cannot be reused.

I do believe that there is a possibility of having an implementation which can make it abstract enough for other algorithms and provide flexibility to have parameters. It doesn't have to be either-or.

To conclude this discussion, I wouldn't want to introduce a non-optimal way to accomplish something when it isn't a blocker. The optimal way unfortunately will require significant refactor to accomplish both the goals.

Closing this for now

shatejas avatar Jul 15 '24 17:07 shatejas