dedupe disk based connected components

The current scaling bottleneck is usually the connected components step in clustering. The current algorithm loads the entire edgelist into memory.

We need to move this to disk.

Things worth exploring

connected components in sqlite https://stackoverflow.com/questions/33465859/a-number-of-connected-components-of-a-graph-in-sql
look for or implement a specialized, lightweight graph store like https://link.springer.com/chapter/10.1007/978-3-319-01863-8_20

May 20 '20 21:05 fgregg

cc @fjsj, perhaps relevant to your interests.

May 20 '20 21:05 fgregg

really interesting idea:

https://twitter.com/mccastelle/status/1263272272787320840

May 21 '20 01:05 fgregg

https://arxiv.org/abs/1802.09478

May 21 '20 15:05 fgregg

Thanks @fgregg, I'll investigate.

With current implementation, I had memory errors at this line, when numpy needs to materialize part of the mmap-ed scored_pairs/edgelist array into main memory because of the fancy indexing:

https://github.com/dedupeio/dedupe/blob/20736bb693aeffa0ac6fb1990485fb39f60c0d7b/dedupe/clustering.py#L36

(Edit: Note the fancy/advanced indexing produces a copy in main memory (instead of a view) because that's the default numpy behavior for all arrays (not only mmap-ed ones). See: https://stackoverflow.com/a/18620619/145349)

I'm not 100% sure, but I think numpy is smart enough to not load the whole scored_pairs into main memory up to that line. That line is where things break for me. Therefore, do any of those papers describe a way of running connected components without materializing a large component in main memory?

Also, worth noting most of my problems were solved when I stopped using non-integer IDs. This dramatically reduced the size of scored_pairs. Maybe it's worth adding a warning when sniff_id_type detects a non-integer ID type. Please LMK if this is something you agree with so I can make a small PR.

May 21 '20 17:05 fjsj

thanks that's very helpful.

May 21 '20 18:05 fgregg

@fjsj i think i have some ideas of what would help here. would it possible for you to send me an edgelist for job that ran into memory problems (scored_pairs['pairs'])

May 22 '20 13:05 fgregg

@fgregg Good to know! I can't share it because it's private data, but I'll try to generate one.

May 22 '20 15:05 fjsj

don't need the underlying data, just the id pairs.

May 22 '20 15:05 fgregg

I've addressed the memmap issue, but much of union-find is still in-memory. Le's see how much headroom the memmap fix buys us.

May 26 '20 00:05 fgregg

superseded by #826

May 28 '20 14:05 fgregg

this is still the bottleneck, and we haven't really made a lot of progress on #826

here's a potential SQL solution: (basically merge-find: https://www.db-fiddle.com/f/2zoZgeAaKsXS9x6EyVCyVo/6, inspired by this paper: https://arxiv.org/abs/1802.09478)

but @mccc's idea of trying to do this in the memmaped array makes a lot of sense still.

what we do now, is do an iterated union-find. this means that we create a union-find-ish data structure in memory, but it would be more efficient both in time and memory to do the simplest thing and

starting at a random edge, then do DFS or BFS to find all the edges connected to that
label all the visited edges with the same component label
keep track of all edges visited
once BFS or DFS terminates, pick an unvisited edge and repeat procedure
repeat until there are not unvisited edges

Feb 02 '22 14:02 fgregg

@fgregg so even after avoiding the random access at sub_graph = edgelist[component] (with sub_graph = edgelist[start:stop]) the bottleneck still exists? Are there any offending lines or the whole algorithm is slow?

Feb 02 '22 15:02 fjsj

you know! i think i'm not sure! i'm running into a problem with this, but i think it's with a version of the library that's before this fix.

Feb 02 '22 15:02 fgregg

if we have a super block, then https://github.com/dedupeio/dedupe/blob/8bfcfb094e0e38bcc55c2c7e87c9161a057c6e11/dedupe/clustering.py#L145 can still be a problem

Feb 02 '22 15:02 fgregg

dedupe dedupe copied to clipboard

disk based connected components

dedupe
dedupe copied to clipboard