splink FYI: paper using combo of embeddings and comparisons

FYI: paper using combo of embeddings and comparisons

Open NickCrews opened this issue 1 year ago • 1 comments

Not sure how directly actionable this is, but I stumbled across it and found it really interesting. It seems very relevant for record linkage.

https://blog.research.google/2023/11/best-of-both-worlds-achieving.html?m=1

Relevant ideas:

Perform blocking (using embeddings) then comparisons in an iterative fashion and repeating multiple times, as they do, instead of doing all blocking then all comparisons
Take their datasets and use them as benchmarks since they seem to have labeled data which is super valuable. They seem to be all embeddings based, but perhaps we could use for our purposes
Perform clustering using the kwikcluster algorithm they cite, it seems like a natural extension and more noise-resistant to the current connected components algorithm

Nov 04 '23 02:11 NickCrews

Do the clustering algorithm of one of the cited baselines: take the k nearest neighbor graph (so at most k*N records elements), materialize it, and feed that through sklearns SpectralClustering

Nov 04 '23 03:11 NickCrews