probablepeople
probablepeople copied to clipboard
incorrectly labeled provider_name when its a mix of person name and corporation_name
Hi, I have a database column called provider_name which can sometimes be a corporation_name like 'Midtown Dental Miami' or sometimes name of a dentist like 'Alek Klebaner DDS' or it could be a facility where 2 people work and the name will be something like 'Dr. Jonathan Chang & Dr. Steven D. Chan'.
I want to use probablepeople to distinguish between the 3 cases.
I get RepeatedLabelError or misclassifications for the following examples - 1) name_str = 'Alek Klebaner DDS' pp.tag(name_str) Out[11]: (OrderedDict([('CorporationName', 'Alek Klebaner DDS')]), 'Corporation')
Collins Harrell, DMD - San Clemente Smiles (OrderedDict([('CorporationName', 'Collins Harrell, DMD San Clemente Smiles')]), 'Corporation')
- name_str='Sasan Ahmadiyar DDS and Associates - Stafford' RepeatedLabelError:
ERROR: Unable to tag this string because more than one area of the string has the same label
ORIGINAL STRING: Sasan Ahmadiyar DDS and Associates - Stafford PARSED TOKENS: [('Sasan', 'CorporationName'), ('Ahmadiyar', 'CorporationName'), ('DDS', 'CorporationName'), ('and', 'CorporationNameAndCompany'), ('Associates', 'CorporationNameAndCompany'), ('Stafford', 'CorporationName')] UNCERTAIN LABEL: CorporationName
- name_str='Dr. Jonathan Chang & Dr. Steven D. Chan' (OrderedDict([('PrefixOther', 'Dr.'), ('GivenName', 'Jonathan'), ('MiddleName', 'Chang'), ('And', '&'), ('SecondPrefixOther', 'Dr.'), ('SecondGivenName', 'Steven'), ('MiddleInitial', 'D.'), ('Surname', 'Chan')]), 'Household') This last example gets almost everything right except that it classifies last_name of Jonathan Chang as middle_name.
You can use this dataset to train your model to do better on this kind of mixed (noisy) data which is quite common in reality. dental_provider_names.txt
-
- is clearly an error
- How do you think that 2. should be labeled?
- How do you think that 3. should be labeled?
-
- is clearly an error.
Thank @fgregg for your quick response.
- I'd classify this as below - Collins Harrell, DMD - San Clemente Smiles
(OrderedDict([ ('GivenName', 'Collins'), ('Surname', 'Harrell') ('DMD','SuffixOther') ('CorporationName', 'San Clemente Smiles')]), 'Person')
Since its a single person's clinic, I'd classify this as a person. On the other hand, if there are 2 or more persons in the name at a facility/group, I'd classify it as a corporation.
- name_str='Sasan Ahmadiyar DDS and Associates - Stafford'
(OrderedDict([ ('GivenName', 'Sasan'), ('Surname', 'Ahmadiyar') ('DDS','SuffixOther') ('and', 'CorporationNameAndCompany') ('Associates', 'CorporationNameAndCompany') ('CorporationNameBranchIdentifier', 'Stafford')]), 'Person') Since its a single person's name appearing in the clinic, I'd classify it as a person. If you think it'd be easier to classify it as a corporation, I am fine with that too.
- Seems to be a common pattern in your data, but is not a pattern I've really seen anywhere else. There are different entities here, a business and a person. What is the relation between the two? Is the person the owner of the business? Is there not a corporation of any type, and the business name is just a fictitious business name?
For 3., is Stafford really a branch identifier? Is there another location of "Sasan Ahmadiyar DDS And Associates"?
Sorry for the delayed response.
For 1) , Collins Harrell, DMD - San Clemente Smiles we have to resolve this entity to our database of providers and we wont know before hand if its going to be stored as a person or a corporation. We'd have to split it and classify it (using probablepeople) and then use the output to match on certain fields to our directory. In this particular case, we found a match to a person "Harrell, Collins R.". It is not clear what the relation between the 2 here is, most likely that Harrell Collins practices at San Clemente Smiles which seems like a clinic name. In any case, it would be best to classify cases like this as a "person" while also attaching a "corporationName" tag as shown in my example above.
For 3) yes there are 3 branches as follows - |provider_name|address|city|province|postal_code| |Sasan Ahmadiyar DDS and Associates|10608 Leavells Road|Fredericksburg|VA|22407| |Sasan Ahmadiyar DDS and Associates - Manassas|7806 Sudley Rd STE 210|Manassas|VA|20109| |Sasan Ahmadiyar DDS and Associates - Stafford|385 Garrisonville Rd STE 108|Stafford|VA|22554|
Maybe it makes sense to classify it as a corporation instead of an individual while also attaching the following tags - ('GivenName', 'Sasan'), ('Surname', 'Ahmadiyar') ('DDS','SuffixOther') ('Sasan Ahmadiyar DDS and Associates','CorporationNameAndCompany')
@fgregg - I am trying to add additional labeled examples into the xml training file. I am looking at the instructions for console labeler of parserator here http://parserator.readthedocs.org/en/latest/ but it expects me to create a new parser. Can I instead, leverage the existing parser from probablepeople and generate an improved crfsuite model?
yes - instructions here: https://github.com/datamade/probablepeople#for-the-nerds
Thank you!
I have encountered the same kind of issue. I am looking at Cal-Access data (from the CA state database of campaign finance and lobbying). An example that I found is:
pp.parse('Resnick, Stewart A. and Affiliated Entities')
[
('Resnick,', 'CorporationName'),
('Stewart', 'CorporationName'),
('A.', 'CorporationName'),
('and', 'CorporationName'),
('Affiliated', 'CorporationName'),
('Entities', 'CorporationName')
]
It seems that this should be a Surname and GivenName and then the rest is of a corporation.
But I noticed that labeled.xml has no lines in which both a "Surname" and a "Corporation" appear. This seems odd.
Another weirdness:
>>> pp.parse('James W. Trimble CPA')
[
('James', 'CorporationName'),
('W.', 'CorporationName'),
('Trimble', 'CorporationName'),
('CPA', 'CorporationName')
]
>>>
>>> pp.parse('James W. Trimble, CPA')
[
('James', 'GivenName'),
('W.', 'MiddleInitial'),
('Trimble,', 'Surname'),
('CPA', 'SuffixOther')
]
Traceback (most recent call last):
File "probable.py", line 13, in
ORIGINAL STRING: RASPANTE SIMONE F DECD PARSED TOKENS: [('RASPANTE', 'Surname'), ('SIMONE', 'GivenName'), ('F', 'MiddleInitial'), ('DECD', 'Surname')] UNCERTAIN LABEL: Surname
When this error is raised, it's likely that either (1) the string is not a valid person/corporation name or (2) some tokens were labeled incorrectly
To report an error in labeling a valid name, open an issue at https://github.com/datamade/probablepeople/issues/new - it'll help us continue to improve probablepeople!
For more information, see the documentation at http://probablepeople.readthedocs.org/