bigbang
bigbang copied to clipboard
Fix resolve_sender_entities to run in O(n log n)
Since running in quadratic time has shown to be cumbersome, there's a hack in resolve_sender_entities that cuts running time down to O(.5 n(n+1))...which is still quadratic time and so isn't much of a help.
With a slight modification it should be possible to adjust this to run in O(n log n) time, which would be much better.
https://github.com/sbenthall/bigbang/blob/master/bigbang/process.py#L99
The way to do this would be to have the range around which the comparison window shrink exponentially.
I would expect effective entity resolution will need to be O(n^2), because I don't think the similarity of names can be reduced to a single dimension along which you can sort. Currently we're using an edit distance measure, and even that doesn't have the sort of transitive properties that allow sort algorithms to be n log n, and subsequent comparison algorithms could be very different. (For example, I think we should compare the email address part and the name part separately and combine a distance score between them.)
Maybe I'm just pessimistic, but I'm guessing we'll have to accept some slowness and can close this issue.