dedupe icon indicating copy to clipboard operation
dedupe copied to clipboard

disk based connected components

Open fgregg opened this issue 4 years ago • 18 comments

The current scaling bottleneck is usually the connected components step in clustering. The current algorithm loads the entire edgelist into memory.

We need to move this to disk.

Things worth exploring

  1. connected components in sqlite https://stackoverflow.com/questions/33465859/a-number-of-connected-components-of-a-graph-in-sql

  2. look for or implement a specialized, lightweight graph store like https://link.springer.com/chapter/10.1007/978-3-319-01863-8_20

fgregg avatar May 20 '20 21:05 fgregg

cc @fjsj, perhaps relevant to your interests.

fgregg avatar May 20 '20 21:05 fgregg

really interesting idea:

https://twitter.com/mccastelle/status/1263272272787320840

fgregg avatar May 21 '20 01:05 fgregg

https://arxiv.org/abs/1802.09478

fgregg avatar May 21 '20 15:05 fgregg

Thanks @fgregg, I'll investigate.

With current implementation, I had memory errors at this line, when numpy needs to materialize part of the mmap-ed scored_pairs/edgelist array into main memory because of the fancy indexing:

https://github.com/dedupeio/dedupe/blob/20736bb693aeffa0ac6fb1990485fb39f60c0d7b/dedupe/clustering.py#L36

(Edit: Note the fancy/advanced indexing produces a copy in main memory (instead of a view) because that's the default numpy behavior for all arrays (not only mmap-ed ones). See: https://stackoverflow.com/a/18620619/145349)

I'm not 100% sure, but I think numpy is smart enough to not load the whole scored_pairs into main memory up to that line. That line is where things break for me. Therefore, do any of those papers describe a way of running connected components without materializing a large component in main memory?

Also, worth noting most of my problems were solved when I stopped using non-integer IDs. This dramatically reduced the size of scored_pairs. Maybe it's worth adding a warning when sniff_id_type detects a non-integer ID type. Please LMK if this is something you agree with so I can make a small PR.

fjsj avatar May 21 '20 17:05 fjsj

thanks that's very helpful.

fgregg avatar May 21 '20 18:05 fgregg

@fjsj i think i have some ideas of what would help here. would it possible for you to send me an edgelist for job that ran into memory problems (scored_pairs['pairs'])

fgregg avatar May 22 '20 13:05 fgregg

@fgregg Good to know! I can't share it because it's private data, but I'll try to generate one.

fjsj avatar May 22 '20 15:05 fjsj

don't need the underlying data, just the id pairs.

fgregg avatar May 22 '20 15:05 fgregg

I've addressed the memmap issue, but much of union-find is still in-memory. Le's see how much headroom the memmap fix buys us.

fgregg avatar May 26 '20 00:05 fgregg

superseded by #826

fgregg avatar May 28 '20 14:05 fgregg

this is still the bottleneck, and we haven't really made a lot of progress on #826

here's a potential SQL solution: (basically merge-find: https://www.db-fiddle.com/f/2zoZgeAaKsXS9x6EyVCyVo/6, inspired by this paper: https://arxiv.org/abs/1802.09478)

but @mccc's idea of trying to do this in the memmaped array makes a lot of sense still.

what we do now, is do an iterated union-find. this means that we create a union-find-ish data structure in memory, but it would be more efficient both in time and memory to do the simplest thing and

  1. starting at a random edge, then do DFS or BFS to find all the edges connected to that
  2. label all the visited edges with the same component label
  3. keep track of all edges visited
  4. once BFS or DFS terminates, pick an unvisited edge and repeat procedure
  5. repeat until there are not unvisited edges

fgregg avatar Feb 02 '22 14:02 fgregg

@fgregg so even after avoiding the random access at sub_graph = edgelist[component] (with sub_graph = edgelist[start:stop]) the bottleneck still exists? Are there any offending lines or the whole algorithm is slow?

fjsj avatar Feb 02 '22 15:02 fjsj

you know! i think i'm not sure! i'm running into a problem with this, but i think it's with a version of the library that's before this fix.

fgregg avatar Feb 02 '22 15:02 fgregg

if we have a super block, then https://github.com/dedupeio/dedupe/blob/8bfcfb094e0e38bcc55c2c7e87c9161a057c6e11/dedupe/clustering.py#L145 can still be a problem

fgregg avatar Feb 02 '22 15:02 fgregg

more on mmap connected components

  • Paper: https://faculty.cc.gatech.edu/~dchau/papers/16-pkdd-mflash.pdf
  • https://github.com/poloclub/m-flash-cpp

fgregg avatar Feb 05 '22 02:02 fgregg

right now our component dict keeps track of edgelist indices, it would probably be much more memory efficient to track vertices

fgregg avatar Feb 14 '22 14:02 fgregg

but how efficient would be to grab edges from vertices? Seems to be a memory locality trade-off

fjsj avatar Feb 14 '22 16:02 fjsj

graphframes library (Spark-based) uses this algorithm. Related issue is here https://github.com/graphframes/graphframes/issues/117

fjsj avatar Feb 16 '22 12:02 fjsj