lucene Should we use the new `expectedVisitedNodes` estimator to speed up filtered vector search?

Description

@benwtrent 's PR at #14836 introduced an estimator for the number of visited nodes based on k and the number of vectors called expectedVisitedNodes(int k, int size).

Currently, filtered vector search works roughly this way:

compute a per-leaf k value based on the global k value
if (filterCost <= perLeafK) do an exact search
else run an approximate search, stop after visiting filterCost+1 nodes
if the approximate search stopped early (and is thus incomplete), run an exact search

Could/should we update step 2 above to do an exact search if (filterCost <= expectedVisitedNodes(perLeafK, size))? This should help save approximate searches that are almost certainly going to be stopped early and ignored anyway?

Jun 25 '25 12:06 jpountz

imho, the heuristic from https://github.com/apache/lucene/pull/14836 is good enough for the sparse vs fixed bitset decision, whereas it might be a bit too naive for the case of filtered search e.g., as per comment, we should probably take maxConn into account. nevertheless it might be just better than what we have today (perLeafK), so it's probably worth a try.

Jun 25 '25 15:06 tteofili

whereas it might be a bit too naive for the case of filtered search

I think we can augment the expected exploration required when executing a filtered search (e.g. more exploration will likely occur when filtering).

However, if the filtered set is already < the expectedVisited set, I don't see why we cannot simply drop to exact search.

Jun 25 '25 20:06 benwtrent

@jpountz I think I should be able to work on this if no one else is on it.

I did a little exploration around it and this is what I understood: per-leaf k value tells if the filter is so selective that it’s less than the expected number of results.

and with expectedVisitedNodes it implies, is the filter so selective that it’s less than the expected number of nodes the search would visit (i.e. the search would be exhaustive anyway)

Aug 09 '25 20:08 akhilesh-k

@akhilesh-k Nobody else is on it that I know of. I expect conflicts with #14963 but I don't think that it should refrain you from giving it a try.

Aug 10 '25 13:08 jpountz

thanks @jpountz. I checked #14963, seems interesting optimization and certainly doesn't block me to try this out.

Aug 14 '25 12:08 akhilesh-k

I opened https://github.com/apache/lucene/pull/15070 to address this issue. Looking forward to reviews and suggestions. Thanks!

Aug 14 '25 16:08 akhilesh-k