skrub icon indicating copy to clipboard operation
skrub copied to clipboard

allowing to use a different distance for the nearest neighbors in fuzzy join

Open jeromedockes opened this issue 1 year ago • 1 comments

Problem Description

ATM we use NearestNeighbors with the l2 distance. if we could choose the distance to use, then using MinHash as the text encoder and "hamming" as the distance would be an approximation of 1 - Jaccard similarity, which I believed is a common choice for fuzzy joining

Feature Description

the Joiner would have a "metric" or "distance" parameter that would be forwarded to NearestNeighbors metric

Alternative Solutions

No response

Additional Context

No response

jeromedockes avatar Dec 18 '23 15:12 jeromedockes

if we could choose the distance to use, then using MinHash as the text encoder and "hamming" as the distance would be an approximation of 1 - Jaccard similarity, which I believed is a common choice for fuzzy joining

Let's first benchmark whether it is actually useful before implementing this

GaelVaroquaux avatar Dec 18 '23 17:12 GaelVaroquaux