dedupe A different approach to "community detection"?

Currently, we use hierarchical agglomerative clustering (HAC) to decide which sets of records refer to the same entity.

Incrementally, this method has been relatively easy to implement, but it has a number of disadvantages.

The algorithm is O(N^3). The current implementation also requires that a form of the dense adjacency matrix be instantiated into memory, which requires memory of size O(N^2). (It would be possible to implement the algorithm to work over a sparse representation)

Because of these algorithmic properties, we typically cannot directly use HAC on all compared pairs. Instead, we treat the compared pars as an edgelist of a graph, and find the connected components, then sequentially apply HAC to these connected components. Additionally, if we get a connected component that is too large, we have to break it apart before feeding parts to HAC.

This is a lot of complexity, and some of the steps are not very principled (in particular the breaking apart of clusters that are too big for HAC to practically handle).

Furthermore, HAC is intended to work over a complete distance matrix of the elements that we want to cluster. Because of blocking, we almost never compare every possible pair of records. We impute a 0 for score of the element pairs that we didn't compare. This is unprincipled.

Taking a step back, we should investigate if there are better ways of achieving our goal for reasonably partitioning the compared record pairs into groups of records that we say are co-referent.

Desirable properties include

Better than O(N^2) algorithmic complexity
Better than O(N^2) memory complexity

One reasonable formulation of the problem would be to say we want to identify group of records generally tend to score as being more similar to each other than to records outside the group. In network terms, this could be translated to relatively dense subgraphs.

Identifying such subgraphs is what the literature on community detection is about. Many community detection algorithms have very favorable algorithmic properties.

We should investigate whether they can be fit to our purposes.

May 28 '20 14:05 fgregg

cc @fjsj @vm-wylbur

May 28 '20 14:05 fgregg

Hi @fgregg, recently I tried to find papers that used community detection for Entity Resolution but couldn't find any resources that applied it to pairwise classification result. I've only found resources that simply do connected components instead.

I could only find a blocking algorithm called Transitive Locally Sensitive Hashing (TLSH) that uses community detection on a graph generated from LSH. But that's probably not useful for this issue.

A while ago, I've experimented using infomap and louvain instead of HAC. It seems to work, but I didn't had a good truth set to figure out which one is the best. There's also leidenalg, the paper says it's better because "the Louvain algorithm may yield arbitrarily badly connected communities. In the worst case, communities may even be disconnected, especially when running the algorithm iteratively".

Worth noting that I had lots of memory issues while treating the graph as a pure Python list of integer pair tuples. Pure Python integers/tuples/lists take much more memory than a numpy matrix.

In case you haven't checked yet, Christen's Data Matching book mentions other clustering techniques at "6.9 Clustering-Based Approaches".

EDIT: I've found this: "Vertex deduplication based on string similarity and community membership". But it's similar to TLSH, not very general.

May 28 '20 17:05 fjsj

A while ago, I've experimented using infomap and louvain instead of HAC. It seems to work, but I didn't had a good truth set to figure out which one is the best. There's also leidenalg, the paper says it's better because "the Louvain algorithm may yield arbitrarily badly connected communities. In the worst case, communities may even be disconnected, especially when running the algorithm iteratively".

would love to see a branch for this

Worth noting that I had lots of memory issues while treating the graph as a pure Python list of integer pair tuples. Pure Python integers/tuples/lists take much more memory than a numpy matrix.

I'm absolutely sure. But if we can find some reasonable algorithms, I'm sure we can figure this out.

May 28 '20 19:05 fgregg

@fgregg here are the experiments with community detection graph algorithms: https://github.com/vintasoftware/dedupe-clustering-experiments/blob/7cb8099e54dd0c724c1fd7fd606b8e6bd9a99ffd/Dedupe%20Clustering%20experiments.ipynb

I've tested with a very small dataset, but there are interesting results. I've added a graphical way to debug the graph. Hope it's useful!

The main question for me is how to apply the clustering threshold in the context of community detection?

Jun 02 '20 00:06 fjsj

The main question for me is how to apply the clustering threshold in the context of community detection?

some community detection algos do have parameters, but one available option is always to pre-filter the edgelist by the threshold before passing to the community detection algorithm.

Jun 02 '20 02:06 fgregg

Found two nice articles that discuss various techniques. They are graph-based, but not community-based:

Framework for Evaluating Clustering Algorithms in Duplicate Detection
Scalable Matching and Clustering of Entities with FAMER (source code for CLIP algorithm here)

Jun 08 '20 14:06 fjsj

Are you going to explore any of these?

Jul 01 '20 15:07 fgregg

Not yet, I would like to explore the general community detection algorithms first.

Do you suggest any dataset? Ideally with the true pairs and various cluster sizes. For example, this febrl one:

This data set contains 5000 records (2000 originals and 3000 duplicates), with a maximum of 5 duplicates based on one original record (and a Zipf distribution of duplicate records). Distribution of duplicates: 168 originals records have 5 duplicate records 161 originals records have 4 duplicate records 212 originals records have 3 duplicate records 256 originals records have 2 duplicate records 368 originals records have 1 duplicate record 1835 originals records have no duplicate record”

Jul 01 '20 19:07 fjsj

many of the examples in the dedupe-examples have ground truths, particularly csv_example, patent_example, and record_linkage_example

Jul 02 '20 16:07 fgregg

the library https://github.com/sknetwork-team/scikit-network has a friendly numpy-like API and it implements the Louvain algorithm

Feb 16 '22 12:02 fjsj