luceneutil icon indicating copy to clipboard operation
luceneutil copied to clipboard

Can we re-enable strict checking for KNN queries?

Open mikemccand opened this issue 2 years ago • 7 comments

A while back we discovered that KNN was producing non-deterministic results even on a deterministic index, and disabled strict top N hit checking for KNNVectorQuery.

We think/thought this was because the "deterministically random" source was seeded from each segment's GUID, but we've since fixed this query to use a fixed random seed (42 of course).

Can we now re-enable strict checking of the top KNN hits?

mikemccand avatar Sep 30 '22 11:09 mikemccand

I think there was an alternative proposal of tracking recall over time instead of checking that hits are strictly the same.

jpountz avatar Oct 04 '22 07:10 jpountz

I believe it's better to track recall overtime rather than strictly checking the hit, because KNN is an approximation itself so it is highly possible we'll able to improve recall by tuning the algorithm or implementation over the time.

Also it'd be nice if we can track recall over the time along with the performance.

zhaih avatar Oct 07 '22 18:10 zhaih

everything in search is an approximation: BM25, etc etc. There's absolutely no reason to give KNN some kind of free pass to leniency park. Leniency isn't going to help anything here.

rmuir avatar Oct 08 '22 11:10 rmuir

I think the idea is to track "recall" over time (ie how close the approximation is) so we can get an idea how the algorithm improves/degrades. We could also have strict checking, but sometimes we will change the results, and allow it to go forward anyway, and we still want to track recall. I posted a hinky chart in my ApacheCon talk: http://falutin.net/2022/10/11/apachecon-2022/ see slide #20 - it shows the recall as a blue line with arrows on top of the underlying QPS graph

On Sat, Oct 8, 2022 at 7:13 AM Robert Muir @.***> wrote:

everything in search is an approximation: BM25, etc etc. There's absolutely no reason to give KNN some kind of free pass to leniency park. Leniency isn't going to help anything here.

— Reply to this email directly, view it on GitHub https://github.com/mikemccand/luceneutil/issues/201#issuecomment-1272295116, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHHUQOS76YT6NWI7WUCM7LWCFJOJANCNFSM6AAAAAAQZVXVMM . You are receiving this because you are subscribed to this thread.Message ID: @.***>

msokolov avatar Oct 11 '22 16:10 msokolov

we could track recall/precision/MAP for BM25 scoring too, but we don't. we are strict and it gives some confidence that scoring is working correctly: hasn't changed unless we intended it to. KNN isn't any different/special here.

rmuir avatar Oct 12 '22 00:10 rmuir

It's OK for the KNN results to change -- that's just needs a nightly regolding as long as we know/accept the source/reason for the change.

mikemccand avatar Oct 13 '22 09:10 mikemccand

there's also no harm in tracking relevance, if its not difficult to do. I just think we should stay "strict"

rmuir avatar Oct 13 '22 11:10 rmuir