probablepeople incorrectly labeled provider_name when its a mix of person name and corporation

Hi, I have a database column called provider_name which can sometimes be a corporation_name like 'Midtown Dental Miami' or sometimes name of a dentist like 'Alek Klebaner DDS' or it could be a facility where 2 people work and the name will be something like 'Dr. Jonathan Chang & Dr. Steven D. Chan'.

I want to use probablepeople to distinguish between the 3 cases.

I get RepeatedLabelError or misclassifications for the following examples - 1) name_str = 'Alek Klebaner DDS' pp.tag(name_str) Out[11]: (OrderedDict([('CorporationName', 'Alek Klebaner DDS')]), 'Corporation')

Collins Harrell, DMD - San Clemente Smiles (OrderedDict([('CorporationName', 'Collins Harrell, DMD San Clemente Smiles')]), 'Corporation')

name_str='Sasan Ahmadiyar DDS and Associates - Stafford' RepeatedLabelError:

ERROR: Unable to tag this string because more than one area of the string has the same label

ORIGINAL STRING: Sasan Ahmadiyar DDS and Associates - Stafford PARSED TOKENS: [('Sasan', 'CorporationName'), ('Ahmadiyar', 'CorporationName'), ('DDS', 'CorporationName'), ('and', 'CorporationNameAndCompany'), ('Associates', 'CorporationNameAndCompany'), ('Stafford', 'CorporationName')] UNCERTAIN LABEL: CorporationName

name_str='Dr. Jonathan Chang & Dr. Steven D. Chan' (OrderedDict([('PrefixOther', 'Dr.'), ('GivenName', 'Jonathan'), ('MiddleName', 'Chang'), ('And', '&'), ('SecondPrefixOther', 'Dr.'), ('SecondGivenName', 'Steven'), ('MiddleInitial', 'D.'), ('Surname', 'Chan')]), 'Household') This last example gets almost everything right except that it classifies last_name of Jonathan Chang as middle_name.

Dec 11 '15 01:12 suhassatish

You can use this dataset to train your model to do better on this kind of mixed (noisy) data which is quite common in reality. dental_provider_names.txt

Dec 11 '15 02:12 suhassatish

1. is clearly an error
How do you think that 2. should be labeled?
How do you think that 3. should be labeled?
1. is clearly an error.

Dec 11 '15 05:12 fgregg

Thank @fgregg for your quick response.

I'd classify this as below - Collins Harrell, DMD - San Clemente Smiles

(OrderedDict([ ('GivenName', 'Collins'), ('Surname', 'Harrell') ('DMD','SuffixOther') ('CorporationName', 'San Clemente Smiles')]), 'Person')

Since its a single person's clinic, I'd classify this as a person. On the other hand, if there are 2 or more persons in the name at a facility/group, I'd classify it as a corporation.

name_str='Sasan Ahmadiyar DDS and Associates - Stafford'
(OrderedDict([ ('GivenName', 'Sasan'), ('Surname', 'Ahmadiyar') ('DDS','SuffixOther') ('and', 'CorporationNameAndCompany') ('Associates', 'CorporationNameAndCompany') ('CorporationNameBranchIdentifier', 'Stafford')]), 'Person') Since its a single person's name appearing in the clinic, I'd classify it as a person. If you think it'd be easier to classify it as a corporation, I am fine with that too.

Dec 11 '15 07:12 suhassatish

Seems to be a common pattern in your data, but is not a pattern I've really seen anywhere else. There are different entities here, a business and a person. What is the relation between the two? Is the person the owner of the business? Is there not a corporation of any type, and the business name is just a fictitious business name?

For 3., is Stafford really a branch identifier? Is there another location of "Sasan Ahmadiyar DDS And Associates"?

Dec 11 '15 14:12 fgregg

Sorry for the delayed response.

For 1) , Collins Harrell, DMD - San Clemente Smiles we have to resolve this entity to our database of providers and we wont know before hand if its going to be stored as a person or a corporation. We'd have to split it and classify it (using probablepeople) and then use the output to match on certain fields to our directory. In this particular case, we found a match to a person "Harrell, Collins R.". It is not clear what the relation between the 2 here is, most likely that Harrell Collins practices at San Clemente Smiles which seems like a clinic name. In any case, it would be best to classify cases like this as a "person" while also attaching a "corporationName" tag as shown in my example above.

Maybe it makes sense to classify it as a corporation instead of an individual while also attaching the following tags - ('GivenName', 'Sasan'), ('Surname', 'Ahmadiyar') ('DDS','SuffixOther') ('Sasan Ahmadiyar DDS and Associates','CorporationNameAndCompany')

Dec 15 '15 02:12 suhassatish

@fgregg - I am trying to add additional labeled examples into the xml training file. I am looking at the instructions for console labeler of parserator here http://parserator.readthedocs.org/en/latest/ but it expects me to create a new parser. Can I instead, leverage the existing parser from probablepeople and generate an improved crfsuite model?

Dec 18 '15 18:12 suhassatish

yes - instructions here: https://github.com/datamade/probablepeople#for-the-nerds

Dec 18 '15 19:12 cathydeng

Thank you!

Dec 18 '15 20:12 suhassatish

I have encountered the same kind of issue. I am looking at Cal-Access data (from the CA state database of campaign finance and lobbying). An example that I found is:

 pp.parse('Resnick, Stewart A. and Affiliated Entities')
 [
     ('Resnick,', 'CorporationName'), 
     ('Stewart', 'CorporationName'), 
     ('A.', 'CorporationName'), 
     ('and', 'CorporationName'), 
     ('Affiliated', 'CorporationName'), 
     ('Entities', 'CorporationName')
  ]

It seems that this should be a Surname and GivenName and then the rest is of a corporation.

But I noticed that labeled.xml has no lines in which both a "Surname" and a "Corporation" appear. This seems odd.

Jan 30 '16 22:01 rkiddy

Another weirdness:

 >>> pp.parse('James W. Trimble CPA')
 [
     ('James', 'CorporationName'), 
     ('W.', 'CorporationName'), 
     ('Trimble', 'CorporationName'), 
     ('CPA', 'CorporationName')
 ]
 >>> 
 >>> pp.parse('James W. Trimble, CPA')
 [
     ('James', 'GivenName'), 
     ('W.', 'MiddleInitial'), 
     ('Trimble,', 'Surname'), 
     ('CPA', 'SuffixOther')
 ]

Jan 31 '16 01:01 rkiddy

Traceback (most recent call last): File "probable.py", line 13, in entitytype=pp.tag(entity[1]) File "/usr/local/lib/python2.7/dist-packages/probablepeople/init.py", line 129, in tag raise RepeatedLabelError(raw_string, parse(raw_string), label) probablepeople.RepeatedLabelError: ERROR: Unable to tag this string because more than one area of the string has the same label

ORIGINAL STRING: RASPANTE SIMONE F DECD PARSED TOKENS: [('RASPANTE', 'Surname'), ('SIMONE', 'GivenName'), ('F', 'MiddleInitial'), ('DECD', 'Surname')] UNCERTAIN LABEL: Surname

When this error is raised, it's likely that either (1) the string is not a valid person/corporation name or (2) some tokens were labeled incorrectly

To report an error in labeling a valid name, open an issue at https://github.com/datamade/probablepeople/issues/new - it'll help us continue to improve probablepeople!

For more information, see the documentation at http://probablepeople.readthedocs.org/

Oct 07 '16 20:10 ghost

probablepeople
probablepeople copied to clipboard

incorrectly labeled provider_name when its a mix of person name and corporation_name

probablepeople probablepeople copied to clipboard

incorrectly labeled provider_name when its a mix of person name and corporation_name

probablepeople
probablepeople copied to clipboard