probablepeople icon indicating copy to clipboard operation
probablepeople copied to clipboard

incorrectly labeled provider_name when its a mix of person name and corporation_name

Open suhassatish opened this issue 9 years ago • 11 comments

Hi, I have a database column called provider_name which can sometimes be a corporation_name like 'Midtown Dental Miami' or sometimes name of a dentist like 'Alek Klebaner DDS' or it could be a facility where 2 people work and the name will be something like 'Dr. Jonathan Chang & Dr. Steven D. Chan'.

I want to use probablepeople to distinguish between the 3 cases.

I get RepeatedLabelError or misclassifications for the following examples - 1) name_str = 'Alek Klebaner DDS' pp.tag(name_str) Out[11]: (OrderedDict([('CorporationName', 'Alek Klebaner DDS')]), 'Corporation')

Collins Harrell, DMD - San Clemente Smiles (OrderedDict([('CorporationName', 'Collins Harrell, DMD San Clemente Smiles')]), 'Corporation')

  1. name_str='Sasan Ahmadiyar DDS and Associates - Stafford' RepeatedLabelError:

ERROR: Unable to tag this string because more than one area of the string has the same label

ORIGINAL STRING: Sasan Ahmadiyar DDS and Associates - Stafford PARSED TOKENS: [('Sasan', 'CorporationName'), ('Ahmadiyar', 'CorporationName'), ('DDS', 'CorporationName'), ('and', 'CorporationNameAndCompany'), ('Associates', 'CorporationNameAndCompany'), ('Stafford', 'CorporationName')] UNCERTAIN LABEL: CorporationName

  1. name_str='Dr. Jonathan Chang & Dr. Steven D. Chan' (OrderedDict([('PrefixOther', 'Dr.'), ('GivenName', 'Jonathan'), ('MiddleName', 'Chang'), ('And', '&'), ('SecondPrefixOther', 'Dr.'), ('SecondGivenName', 'Steven'), ('MiddleInitial', 'D.'), ('Surname', 'Chan')]), 'Household') This last example gets almost everything right except that it classifies last_name of Jonathan Chang as middle_name.

suhassatish avatar Dec 11 '15 01:12 suhassatish

You can use this dataset to train your model to do better on this kind of mixed (noisy) data which is quite common in reality. dental_provider_names.txt

suhassatish avatar Dec 11 '15 02:12 suhassatish

    1. is clearly an error
  • How do you think that 2. should be labeled?
  • How do you think that 3. should be labeled?
    1. is clearly an error.

fgregg avatar Dec 11 '15 05:12 fgregg

Thank @fgregg for your quick response.

  1. I'd classify this as below - Collins Harrell, DMD - San Clemente Smiles

(OrderedDict([ ('GivenName', 'Collins'), ('Surname', 'Harrell') ('DMD','SuffixOther') ('CorporationName', 'San Clemente Smiles')]), 'Person')

Since its a single person's clinic, I'd classify this as a person. On the other hand, if there are 2 or more persons in the name at a facility/group, I'd classify it as a corporation.

  1. name_str='Sasan Ahmadiyar DDS and Associates - Stafford'
    (OrderedDict([ ('GivenName', 'Sasan'), ('Surname', 'Ahmadiyar') ('DDS','SuffixOther') ('and', 'CorporationNameAndCompany') ('Associates', 'CorporationNameAndCompany') ('CorporationNameBranchIdentifier', 'Stafford')]), 'Person') Since its a single person's name appearing in the clinic, I'd classify it as a person. If you think it'd be easier to classify it as a corporation, I am fine with that too.

suhassatish avatar Dec 11 '15 07:12 suhassatish

  1. Seems to be a common pattern in your data, but is not a pattern I've really seen anywhere else. There are different entities here, a business and a person. What is the relation between the two? Is the person the owner of the business? Is there not a corporation of any type, and the business name is just a fictitious business name?

For 3., is Stafford really a branch identifier? Is there another location of "Sasan Ahmadiyar DDS And Associates"?

fgregg avatar Dec 11 '15 14:12 fgregg

Sorry for the delayed response.

For 1) , Collins Harrell, DMD - San Clemente Smiles we have to resolve this entity to our database of providers and we wont know before hand if its going to be stored as a person or a corporation. We'd have to split it and classify it (using probablepeople) and then use the output to match on certain fields to our directory. In this particular case, we found a match to a person "Harrell, Collins R.". It is not clear what the relation between the 2 here is, most likely that Harrell Collins practices at San Clemente Smiles which seems like a clinic name. In any case, it would be best to classify cases like this as a "person" while also attaching a "corporationName" tag as shown in my example above.

For 3) yes there are 3 branches as follows - |provider_name|address|city|province|postal_code| |Sasan Ahmadiyar DDS and Associates|10608 Leavells Road|Fredericksburg|VA|22407| |Sasan Ahmadiyar DDS and Associates - Manassas|7806 Sudley Rd STE 210|Manassas|VA|20109| |Sasan Ahmadiyar DDS and Associates - Stafford|385 Garrisonville Rd STE 108|Stafford|VA|22554|

Maybe it makes sense to classify it as a corporation instead of an individual while also attaching the following tags - ('GivenName', 'Sasan'), ('Surname', 'Ahmadiyar') ('DDS','SuffixOther') ('Sasan Ahmadiyar DDS and Associates','CorporationNameAndCompany')

suhassatish avatar Dec 15 '15 02:12 suhassatish

@fgregg - I am trying to add additional labeled examples into the xml training file. I am looking at the instructions for console labeler of parserator here http://parserator.readthedocs.org/en/latest/ but it expects me to create a new parser. Can I instead, leverage the existing parser from probablepeople and generate an improved crfsuite model?

suhassatish avatar Dec 18 '15 18:12 suhassatish

yes - instructions here: https://github.com/datamade/probablepeople#for-the-nerds

cathydeng avatar Dec 18 '15 19:12 cathydeng

Thank you!

suhassatish avatar Dec 18 '15 20:12 suhassatish

I have encountered the same kind of issue. I am looking at Cal-Access data (from the CA state database of campaign finance and lobbying). An example that I found is:

 pp.parse('Resnick, Stewart A. and Affiliated Entities')
 [
     ('Resnick,', 'CorporationName'), 
     ('Stewart', 'CorporationName'), 
     ('A.', 'CorporationName'), 
     ('and', 'CorporationName'), 
     ('Affiliated', 'CorporationName'), 
     ('Entities', 'CorporationName')
  ]

It seems that this should be a Surname and GivenName and then the rest is of a corporation.

But I noticed that labeled.xml has no lines in which both a "Surname" and a "Corporation" appear. This seems odd.

rkiddy avatar Jan 30 '16 22:01 rkiddy

Another weirdness:

 >>> pp.parse('James W. Trimble CPA')
 [
     ('James', 'CorporationName'), 
     ('W.', 'CorporationName'), 
     ('Trimble', 'CorporationName'), 
     ('CPA', 'CorporationName')
 ]
 >>> 
 >>> pp.parse('James W. Trimble, CPA')
 [
     ('James', 'GivenName'), 
     ('W.', 'MiddleInitial'), 
     ('Trimble,', 'Surname'), 
     ('CPA', 'SuffixOther')
 ]

rkiddy avatar Jan 31 '16 01:01 rkiddy

Traceback (most recent call last): File "probable.py", line 13, in entitytype=pp.tag(entity[1]) File "/usr/local/lib/python2.7/dist-packages/probablepeople/init.py", line 129, in tag raise RepeatedLabelError(raw_string, parse(raw_string), label) probablepeople.RepeatedLabelError: ERROR: Unable to tag this string because more than one area of the string has the same label

ORIGINAL STRING: RASPANTE SIMONE F DECD PARSED TOKENS: [('RASPANTE', 'Surname'), ('SIMONE', 'GivenName'), ('F', 'MiddleInitial'), ('DECD', 'Surname')] UNCERTAIN LABEL: Surname

When this error is raised, it's likely that either (1) the string is not a valid person/corporation name or (2) some tokens were labeled incorrectly

To report an error in labeling a valid name, open an issue at https://github.com/datamade/probablepeople/issues/new - it'll help us continue to improve probablepeople!

For more information, see the documentation at http://probablepeople.readthedocs.org/

ghost avatar Oct 07 '16 20:10 ghost