bigbang icon indicating copy to clipboard operation
bigbang copied to clipboard

consolidate entity resolution scripts into single module

Open sbenthall opened this issue 8 years ago • 8 comments

There are multiple "entity resolution" functions for mailing lists currently in BigBang.

These should be consolidated into a single module with the relevant differences documented.

sbenthall avatar May 25 '16 21:05 sbenthall

If I recall correctly, I wound up building my own entity resolver for performance reasons. Your (awesome) entity resolution requires an n^2 Levenstein distance matrix which was prohibitively expensive for some longer lists (and combinations of lists, which is what I've spent some time looking at). https://sudoroom.org/pipermail/bigbang-dev/2016-May/000011.html

Out of curiosity, how big was the combined list so that O(n^2) was prohibitive? Naively, I would assume that entity resolution would necessarily require comparing every entity to every other entity, unless we reckon that a particular sorting on entities will put all duplicate entities near one another.

It could be that visualizing the Levenstein distance matrix is way more computationally intensive than calculating it -- the operation itself should be super cheap, it's basically just a string comparison, yeah?

npdoty avatar May 25 '16 21:05 npdoty

@npdoty I wonder if you'd be willing to look this over and see whether you think it's still appropriate to try to include this change in the 0.2 release.

sbenthall avatar Oct 23 '16 22:10 sbenthall

To summarize:

  • @npdoty authored a pair-wise distance matrix, Levenshtein distance, and a Juypter notebook that visualizes the matrix and consolidates them individually
  • @sbenthall authored process.resolve_sender_entities, which uses the same distance metric, but tries not to run it on every pair (just on a fraction that are closest to it when sorted alphabetically), and then does a nice little connected graph reduction
  • @Aryan-Barbarian authored a separate module for common entity resolution fixes on email addresses in Git commits

Regarding the first two, I've moved my code to use the resolve_sender_entities method, because I liked the graph/partitioning for consolidation. I parameterized that function to allow for a configurable threshold distance greater than 0; you can see that change in #281. However, I didn't actually notice the only-testing-against-nearby-in-the-alphabet limitation. I think we should remove that, or make it a configurable option if it's important for performance in some use cases for @sbenthall and others. @sbenthall if you think it's important, I'll take this as an action item to make that an option; if not, then I'll just submit a PR that removes that.

I don't know about merging the mailing list entity resolution and the git commit entity resolution. While some parts might be in common, the approach and the range of data we see is quite different, so I think these will need to remain separate for some time.

npdoty avatar Sep 14 '17 01:09 npdoty

This seems complex enough and has been inactive for long enough that it should not block the 0.2 release. I'm punting it to 0.3

sbenthall avatar Apr 19 '18 15:04 sbenthall

I think my goal for this issue is just to consolidate the resolve_sender_entities method to do pairwise distance analysis with configurable cutoff (which I should have code for already) and remove alternatives.

In a later issue, I think there are very substantial improvements to be made. There are known algorithms (for example: lower-case, remove punctuation, split into tokens, alphabetize the tokens, then compare distance; compare email address and name fields separately) that could do a better job. And I've been working on a manual workflow which could also be a generally useful feature.

npdoty avatar Apr 26 '18 21:04 npdoty

An update, in this notebook on tenure in IETF WGs I did entity resolution a little differently.

  1. I've created new parse functions for normalizing email address and creating a normalized (tokenized and then lexicographically ordered) version of the name.
  2. I'm using pandas groupby and agg methods to do the actual combination of the data, which is useful because different columns need to be aggregated in different ways, depending on what the data is. In that particular notebook, I'm not just working on activity frames (which are day-by-day message counts) but on beginning and end dates. So we might need the entity resolution functionality to be flexible enough to handle the combination of the rows in different ways depending on what the dataframe actually is.

npdoty avatar Feb 07 '19 18:02 npdoty

This issue is going to come up again in the context of parsing affiliation data (see #367) because organization names also need entity resolution (i.e. 'Cisco', 'Cisco Systems', 'cisco', etc.)

sbenthall avatar Feb 26 '20 17:02 sbenthall

I was wondering whether this might be helpful to filter out personal contributions, and this to find affiliations (as suggested during the IAB-AID workshop.

Christovis avatar Nov 29 '21 17:11 Christovis