probablepeople icon indicating copy to clipboard operation
probablepeople copied to clipboard

CorporationNameOrganization not identified as 'Corporation' in tag function

Open stevetracvc opened this issue 3 years ago • 0 comments

I've been chasing down a bug I'm encountering in Dedupe (which uses probablepeople) and I traced it to this line:

https://github.com/datamade/probablepeople/blob/672075cb23a86321d35b3b407b3f2d5e2dcadfa4/probablepeople/init.py#L141

For some reason, the name "12society" is identified as "CorporationNameOrganization" by the parse function when using the "company" model/tagger. I don't know the details of the trained model, but it spits out things other than "CorporationName" which is a problem: the tagged OrderedDict doesn't contain a key "CorporationName" but it definitely is not a person!

I added this to the conditional to set the name_type to 'Corporation':

any(s.find("Corporation") >= 0 for s in tagged)

I'd generate a PR but it's ugly and I don't know what other outputs to expect from the parse function. Any help is appreciated!

import probablepeople

probablepeople.tag("12society")
Out[2]: (OrderedDict([('Surname', '12society')]), 'Person')

probablepeople.tag("12society", "company")
Out[3]: (OrderedDict([('CorporationNameOrganization', '12society')]), 'Person')

# I then add my 'any(...)' code from above:

probablepeople.tag("12society", "company")
Out[4]: (OrderedDict([('CorporationNameOrganization', '12society')]), 'Corporation')

"P2 Science" is another name that pops up as CorporationNameOrganization

stevetracvc avatar Mar 20 '21 16:03 stevetracvc