splink
splink copied to clipboard
FYI: paper using combo of embeddings and comparisons
Not sure how directly actionable this is, but I stumbled across it and found it really interesting. It seems very relevant for record linkage.
https://blog.research.google/2023/11/best-of-both-worlds-achieving.html?m=1
Relevant ideas:
- Perform blocking (using embeddings) then comparisons in an iterative fashion and repeating multiple times, as they do, instead of doing all blocking then all comparisons
- Take their datasets and use them as benchmarks since they seem to have labeled data which is super valuable. They seem to be all embeddings based, but perhaps we could use for our purposes
- Perform clustering using the kwikcluster algorithm they cite, it seems like a natural extension and more noise-resistant to the current connected components algorithm
- Do the clustering algorithm of one of the cited baselines: take the k nearest neighbor graph (so at most k*N records elements), materialize it, and feed that through sklearns SpectralClustering