scanns
scanns copied to clipboard
Wrong distance value sometimes when using a separate candidates pool
I realize this project is dead, but leaving this here in case it helps someone else.
When using getAllNearestNeighbors
(with different RDDs for items and candidates), I noticed that the distance column is often incorrect. This is because updateHashBuckets
is called with the same itemVectors
for both the items and the candidates, where it maintains a mapping from itemId
to item vector. If there are overlapping IDs between the items and candidate RDDs, then you end up with the distance between an item vector and another item vector (rather than with a candidate vector) who happened to be in the same hash bucket and shares an ID with the candidate vector it was supposed to match up with.