Forest Gregg

Results 351 comments of Forest Gregg

@fjsj i think i have some ideas of what would help here. would it possible for you to send me an edgelist for job that ran into memory problems (`scored_pairs['pairs']`)

don't need the underlying data, just the id pairs.

I've addressed the memmap issue, but much of union-find is still in-memory. Le's see how much headroom the memmap fix buys us.

this is still the bottleneck, and we haven't really made a lot of progress on #826 here's a potential SQL solution: (basically merge-find: https://www.db-fiddle.com/f/2zoZgeAaKsXS9x6EyVCyVo/6, inspired by this paper: https://arxiv.org/abs/1802.09478) but...

you know! i think i'm not sure! i'm running into a problem with this, but i think it's with a version of the library that's before this fix.

if we have a super block, then https://github.com/dedupeio/dedupe/blob/8bfcfb094e0e38bcc55c2c7e87c9161a057c6e11/dedupe/clustering.py#L145 can still be a problem

more on mmap connected components * Paper: https://faculty.cc.gatech.edu/~dchau/papers/16-pkdd-mflash.pdf * https://github.com/poloclub/m-flash-cpp

right now our component dict keeps track of edgelist indices, it would probably be much more memory efficient to track vertices