ReCiter
ReCiter copied to clipboard
Add first name likelihood scoring strategy
Background
For PMID = 34739873, Jeetayu Biswas (jeb9333) has a nameMatchFirstScore of 1.852.
For PMID = 23834756, John Moore (jpm2003) has a nameMatchFirstScore of 1.852.
In a sample set of ~4 million names in PubMed, of which 288,953 start with J:
- 1 name is Jeetayu. This is at the 0.8952 percentile.
- 10981 names are John. This is at the 99.9519 percentile.
These matches are not scored optimally. It's far more unlikely that a name will match on Jeetayu and, therefore it should receive a higher score.
Accounting for these differences against the 250k records in WCM's dataset can improve overall accuracy, relatively speaking, by 6-7 percent, mainly by cutting down on false positives.
Low values indicate that a name is common, and high values indicate that a name is uncommon. Note that this approach accounts for likelihood for a given letter. Q is a less common first initial and as a result "Qi" would have a relatively higher penalty against it than, say, John when compared it against all J's.
Requirements
New DynamoDB table
Create a new table for DynamoDB called "firstNameFrequency." Here is the file as JSON.
The file should live at ReCiter/src/main/resources/files/firstNameFrequency.json
To improve performance, the firstName value should be indexed in DynamoDB.
Create new values in application.properties
strategy.first.name.likelihood=true
strategy.nameMatchFirstLikelihoodScore.maximumScore=0.14
strategy.nameMatchFirstLikelihoodScore.weight=0.82
Create new strategy in code
Following existing design patterns and create a new scoring strategy in the code, first.name.likelihood. This is somewhat similar to the Gender Strategy in that it looks up values from DynamoDB and has to account for the possibility of multiple values.
Here's how it should work:
- Remove periods from institutionalAuthorNameFirstName
- Find all substrings, as delimited by a space, in institutionalAuthorNameFirstName.
- Exclude any substrings that are one character
- Now we need to look up the values in the firstNameFrequency.json file
- If there is no result, we go with the value in strategy.nameMatchFirstLikelihoodScore.maximumScore.
- Multiply whatever you retrieve by strategy.nameMatchFirstLikelihoodScore.weight.
- The result is nameMatchFirstLikelihoodScore.
- Include this when computing nameScoreTotal.
To optimize performance, we should only be looking up a single name once each time Feature Generator is run.
Output in Feature Generator API
Here's how this should look in the Feature Generator API output. See the last line.
"authorNameEvidence": {
"institutionalAuthorName": {
"firstName": "Curtis",
"firstInitial": "C",
"lastName": "Cole"
},
"articleAuthorName": {
"firstName": "Curtis",
"firstInitial": "C",
"lastName": "Cole"
},
"nameScoreTotal": 3.31,
"nameMatchFirstType": "full-exact",
"nameMatchFirstScore": 1.852,
"nameMatchMiddleType": "identityNull-MatchNotAttempted",
"nameMatchMiddleScore": 0.794,
"nameMatchLastType": "full-exact",
"nameMatchLastScore": 0.664,
"nameMatchModifierScore": 0,
"nameMatchFirstLikelihoodScore": -0.058
},
Test cases
In each case, we are multiplying by strategy.nameMatchFirstLikelihoodScore.weight.
personID | pmid | name | logic |
---|---|---|---|
bas4003 | 34973498 | Barzan | This firstName is missing from JSON so use strategy.nameMatchFirstLikelihoodScore.maximumScore |
kpxu | 14700639 | Kangpu | This firstName is missing from JSON so use strategy.nameMatchFirstLikelihoodScore.maximumScore |
sky2001 | 27890427 | Sae hee | Break up into "Sae" and "hee". Look up individually. Average result. |
muh2006 | 35713518 | Mu ji | Break up into "Mu" and "ji". Look up individually. Average result. |
stf3001 | 2331227 | Steven g | Look up "Steven" only. |
bab2013 | 8069273 | A. bartley | Look up "bartley" only. |
als4033 | 36114352 | Alia mahmoud hassan | Break up into "Alia", "mahmoud", and "hassan". Look up individually. Average result. |
din9007 | 33631875 | Dilfuza | Look up "Dilfluza" |
aha4006 | 32206638 | Alanoud | Look up "Alanoud" |
ceg9018 | 12127811 | Cecily | Look up "Cecily" |
dis4002 | 32576946 | Dimitry | Look up "Dimitry" |