splink icon indicating copy to clipboard operation
splink copied to clipboard

FYI: paper using combo of embeddings and comparisons

Open NickCrews opened this issue 1 year ago • 1 comments

Not sure how directly actionable this is, but I stumbled across it and found it really interesting. It seems very relevant for record linkage.

https://blog.research.google/2023/11/best-of-both-worlds-achieving.html?m=1

Relevant ideas:

  1. Perform blocking (using embeddings) then comparisons in an iterative fashion and repeating multiple times, as they do, instead of doing all blocking then all comparisons
  2. Take their datasets and use them as benchmarks since they seem to have labeled data which is super valuable. They seem to be all embeddings based, but perhaps we could use for our purposes
  3. Perform clustering using the kwikcluster algorithm they cite, it seems like a natural extension and more noise-resistant to the current connected components algorithm

NickCrews avatar Nov 04 '23 02:11 NickCrews

  1. Do the clustering algorithm of one of the cited baselines: take the k nearest neighbor graph (so at most k*N records elements), materialize it, and feed that through sklearns SpectralClustering

NickCrews avatar Nov 04 '23 03:11 NickCrews