bigbang icon indicating copy to clipboard operation
bigbang copied to clipboard

Fix resolve_sender_entities to run in O(n log n)

Open sbenthall opened this issue 10 years ago • 1 comments

Since running in quadratic time has shown to be cumbersome, there's a hack in resolve_sender_entities that cuts running time down to O(.5 n(n+1))...which is still quadratic time and so isn't much of a help.

With a slight modification it should be possible to adjust this to run in O(n log n) time, which would be much better.

https://github.com/sbenthall/bigbang/blob/master/bigbang/process.py#L99

The way to do this would be to have the range around which the comparison window shrink exponentially.

sbenthall avatar Jun 02 '15 17:06 sbenthall

I would expect effective entity resolution will need to be O(n^2), because I don't think the similarity of names can be reduced to a single dimension along which you can sort. Currently we're using an edit distance measure, and even that doesn't have the sort of transitive properties that allow sort algorithms to be n log n, and subsequent comparison algorithms could be very different. (For example, I think we should compare the email address part and the name part separately and combine a distance score between them.)

Maybe I'm just pessimistic, but I'm guessing we'll have to accept some slowness and can close this issue.

npdoty avatar Sep 14 '17 01:09 npdoty