ReCiter icon indicating copy to clipboard operation
ReCiter copied to clipboard

Add first name likelihood scoring strategy

Open paulalbert1 opened this issue 1 year ago • 0 comments

Background

For PMID = 34739873, Jeetayu Biswas (jeb9333) has a nameMatchFirstScore of 1.852.

For PMID = 23834756, John Moore (jpm2003) has a nameMatchFirstScore of 1.852.

In a sample set of ~4 million names in PubMed, of which 288,953 start with J:

  • 1 name is Jeetayu. This is at the 0.8952 percentile.
  • 10981 names are John. This is at the 99.9519 percentile.

These matches are not scored optimally. It's far more unlikely that a name will match on Jeetayu and, therefore it should receive a higher score.

Accounting for these differences against the 250k records in WCM's dataset can improve overall accuracy, relatively speaking, by 6-7 percent, mainly by cutting down on false positives.

Low values indicate that a name is common, and high values indicate that a name is uncommon. Note that this approach accounts for likelihood for a given letter. Q is a less common first initial and as a result "Qi" would have a relatively higher penalty against it than, say, John when compared it against all J's.

Requirements

New DynamoDB table

Create a new table for DynamoDB called "firstNameFrequency." Here is the file as JSON.

The file should live at ReCiter/src/main/resources/files/firstNameFrequency.json

To improve performance, the firstName value should be indexed in DynamoDB.

Create new values in application.properties

strategy.first.name.likelihood=true

strategy.nameMatchFirstLikelihoodScore.maximumScore=0.14
strategy.nameMatchFirstLikelihoodScore.weight=0.82

Create new strategy in code

Following existing design patterns and create a new scoring strategy in the code, first.name.likelihood. This is somewhat similar to the Gender Strategy in that it looks up values from DynamoDB and has to account for the possibility of multiple values.

Here's how it should work:

  • Remove periods from institutionalAuthorNameFirstName
  • Find all substrings, as delimited by a space, in institutionalAuthorNameFirstName.
  • Exclude any substrings that are one character
  • Now we need to look up the values in the firstNameFrequency.json file
  • If there is no result, we go with the value in strategy.nameMatchFirstLikelihoodScore.maximumScore.
  • Multiply whatever you retrieve by strategy.nameMatchFirstLikelihoodScore.weight.
  • The result is nameMatchFirstLikelihoodScore.
  • Include this when computing nameScoreTotal.

To optimize performance, we should only be looking up a single name once each time Feature Generator is run.

Output in Feature Generator API

Here's how this should look in the Feature Generator API output. See the last line.

        "authorNameEvidence": {
          "institutionalAuthorName": {
            "firstName": "Curtis",
            "firstInitial": "C",
            "lastName": "Cole"
          },
          "articleAuthorName": {
            "firstName": "Curtis",
            "firstInitial": "C",
            "lastName": "Cole"
          },
          "nameScoreTotal": 3.31,
          "nameMatchFirstType": "full-exact",
          "nameMatchFirstScore": 1.852,
          "nameMatchMiddleType": "identityNull-MatchNotAttempted",
          "nameMatchMiddleScore": 0.794,
          "nameMatchLastType": "full-exact",
          "nameMatchLastScore": 0.664,
          "nameMatchModifierScore": 0,
          "nameMatchFirstLikelihoodScore": -0.058
        },

Test cases

In each case, we are multiplying by strategy.nameMatchFirstLikelihoodScore.weight.

personID pmid name logic
bas4003 34973498 Barzan This firstName is missing from JSON so use strategy.nameMatchFirstLikelihoodScore.maximumScore
kpxu 14700639 Kangpu This firstName is missing from JSON so use strategy.nameMatchFirstLikelihoodScore.maximumScore
sky2001 27890427 Sae hee Break up into "Sae" and "hee". Look up individually. Average result.
muh2006 35713518 Mu ji Break up into "Mu" and "ji". Look up individually. Average result.
stf3001 2331227 Steven g Look up "Steven" only.
bab2013 8069273 A. bartley Look up "bartley" only.
als4033 36114352 Alia mahmoud hassan Break up into "Alia", "mahmoud", and "hassan". Look up individually. Average result.
din9007 33631875 Dilfuza Look up "Dilfluza"
aha4006 32206638 Alanoud Look up "Alanoud"
ceg9018 12127811 Cecily Look up "Cecily"
dis4002 32576946 Dimitry Look up "Dimitry"

paulalbert1 avatar Mar 30 '23 17:03 paulalbert1