linktransformer icon indicating copy to clipboard operation
linktransformer copied to clipboard

Suggestion to implement range_search

Open lamberpj opened this issue 4 months ago • 1 comments

Hi All, again - wonderful package and just terrific work.

One possible extension you might one day consider would be using FAISS's range_search function, instead of search (see https://github.com/facebookresearch/faiss/wiki/Special-operations-on-indexes#range-search). This would allow for a "many-to-many" match in the more traditional sense, perhaps aligning the behaviour of the LT package to prior fuzzy matching packages.

The main drawback is that it is not GPU-friendly, but works pretty efficiently on CPUs in my experience.

FWIW, my use-case is to match the universe of job-postings to DnB establishments. I use the range_search along with your firm-name embeddings to to build a dataset with all pairwise matches above a pretty low similarity threshold (0.5). This then gives me a huge set of potential matches, and I use an expectation-maximisation algorithm after this which considers both similarity-scores as well as other structured covariates (but not necessarily exact matching criteria) like industry codes, location-distance, etc to resolve the best match from this candidate set.

One day I would be happy to help implementing this, if you feel it's something you would want to pursue.

Thanks again for all the great work, it's hugely appreciated by many!

lamberpj avatar Feb 25 '24 22:02 lamberpj