scanns icon indicating copy to clipboard operation
scanns copied to clipboard

Wrong distance value sometimes when using a separate candidates pool

Open maksle opened this issue 2 years ago • 0 comments

I realize this project is dead, but leaving this here in case it helps someone else.

When using getAllNearestNeighbors (with different RDDs for items and candidates), I noticed that the distance column is often incorrect. This is because updateHashBuckets is called with the same itemVectors for both the items and the candidates, where it maintains a mapping from itemId to item vector. If there are overlapping IDs between the items and candidate RDDs, then you end up with the distance between an item vector and another item vector (rather than with a candidate vector) who happened to be in the same hash bucket and shares an ID with the candidate vector it was supposed to match up with.

maksle avatar Apr 18 '22 17:04 maksle