ReCiter icon indicating copy to clipboard operation
ReCiter copied to clipboard

Fix targetAuthor selection bugs

Open paulalbert1 opened this issue 5 years ago • 1 comments

Suggested changes

I recommend the following changes to improve our ability to infer target author. This will improve accuracy.

1. Before attempting match, first temporarily transform all article and identity names to same case (i.e., lower or upper).

  • example: haa2019, 31970682
  • example: haa2019, 33056991

2. Before attempting match, first permanently transform both article email and identity email to lower case. This should improve accuracy of email scoring strategy as well.

3. Before attempting match, confirm we are removing accents and diacritical marks. This example has a name that should be a direct match: "Ivan" (article and identity for first name), and "Diaz munoz" (article and identity for last name)

  • example: ild2005, 23778512

4. Check to make sure most favorable match against available identities is occurring in cases where someone has multiple identity names.

  • example: cah2024 has identity data of "Caitlin Hill" and "Hill Caitlin". It would appear for a good chunk of cah2024's pubs (22655857, 23727091, 27264359, 27076424, 26697387) that ReCiter is trying to match against "Hill Caitlin."
  • example: markskr has identity data of "Kristen M Debusschere" and "Kristen Marie Marks". It would appear for a good chunk of markskr's pubs (31427059, 31249846, 31220220, 32667919, 32934969) that ReCiter is trying to match against "Kristen M Debusschere."

5. If there are than two matches (e.g., email match in step 7), look through remaining tests to find first discrepancy.

  • Logic:
    • if 0 results, go to next.
    • if 1 result, stop.
    • if > 1 result, assign remaining authors as false, go to next.
  • The alternative is to hard code certain additional tests. (For example, start with strict first name article/identity. Then do strict last name article/identity. Then do last name article is substring or last name identity.)
  • example: rbd2001 has 39 accepted pubs and NONE have a target author assigned. I suspect this is because there are redundant names in identity: "Robert B. Darnell, Robert B Darnell, Robert B Darnnell". Note that one of the alternate names had an extra N in it: Robert B Darnnell vs. Robert B Darnell. Still, shouldn't this match on first name.
  • example: lgr2002 matches two authors (L Roth, M Roth) for 30861425. Another: mah4006 matches two authors (M Hidalgo, G Medina) for 10378674. These can be accurately disambiguated by looking for first name identity initial matching first name article initial.
  • example: erp9042 27098310 is similar to above
  • example: emc2013, 28192528
  • example: mah4006, 29217526
  • example: mpz2001, 27511963
  • example: jat2021, 26700621
  • example: alg9037, 16377652
<Author ValidYN="Y">
<LastName>Della Valle</LastName>
<ForeName>A Gonzalez</ForeName>
<Initials>AG</Initials>
</Author>

"alternateNames": [
    {
      "firstName": "Alejandro",
      "firstInitial": "A",
      "lastName": "Gonzalez della valle"
    }
  ],

6. Add attempted step as step 13.

  • see if there's a case where both of these conditions are true:
    • article first name is a substring of identity first name, or vise versa
    • identity last name is a substring of identity last name, or vise versa
  • example: mgs4001 28058071 28746303

7. Add attempted step as step 14: try matching identity middle name to article last name

  • example: bce2003, 9038376

8. Add attempted step as step 15: try matching first to last name, and last to first name

  • example: lid9035, 24680142

9. Add attempted step as step 16: check to see if article firstName is substring of institution firstName.

  • example: mat2041, 30146489

10. Add attempted step as step 17: try matching identity lastname to article first name, and identity first name initial to article last name initial

  • example: bgharvey, 12349828

11. Add attempted step as step 18: try matching identity first name initial to article first name initial, and identity last name initial to article last name initial

  • example: ark2011, 8292578
  • example: _csied77, 30674652, 30978303, 21415408

12. Add attempt step as step 19: try matching article first name as substring of identity middle name.

  • example: mel2023, 19416831

Will not fix for now

Case 1: Email is assigned to wrong author

e.g., kivanova, 22364404

Screen Shot 2019-07-18 at 6 11 11 PM

Case 2: Multiple identical author names

This query returns 850 results out of our corpus of several hundred thousand. Most of these are publisher errors.

select * from
(select distinct  personIdentifier, pmid, authorFirstName, authorLastName, count(*)
from personArticleAuthor 
group by personIdentifier, pmid, authorFirstName, authorLastName
having count(*) > 1) x
where char_length(authorFirstName) > 3

e.g., jod9184, 30804071

missingTargetAuthor-30804071

Logic for identifying cases where no targetAuthor is assigned

select distinct p.cwid, p.surname, p.middleName, p.givenName, a.pmid, e.rank, totalAuthorCount, 
case
when e.rank = '1' then 'first'
when e.rank = totalAuthorCount then 'last'
end as firstLast,
publicationTypeCanonical, publicationDateStandardized, datePublicationAddedToEntrez, articleTitle, journalTitleVerbose, JournalImpactFactor, primaryTitle, primaryAcademicDepartment, primaryAcademicDivision, primaryProgram, fullTimeFaculty, studentMD, studentMDPhD, studentPhDTriI, studentPhDWeill,
case 
when 
mid(publicationDateStandardized, 1, 4) >= startYearWCM then  '✔'
else null
end as authoredWhileAtWCM 
from identity p
inner join personArticle a on a.personIdentifier = p.cwid
left outer join impactFactor2019 i on i.issn = a.issn
left join personArticleAuthor e on e.personIdentifier = a.personIdentifier and a.pmid = e.pmid
left join (select pmid, max(rank) as totalAuthorCount
from personArticleAuthor 
group by pmid) x  on x.pmid = a.pmid
where a.pmid is not null
and userAssertion = 'ACCEPTED'
and datePublicationAddedToEntrez > '2019-11-24' 
and targetAuthor is null 
and (fullTimeFaculty = '✔' OR studentMD =  '✔' OR studentMDPhD =  '✔' OR studentPhDTriI =  '✔' OR studentPhDWeill =  '✔')
order by datePublicationAddedToEntrez desc

Logic for identifying cases where multiple target authors assigned for given pmid, cwid

select p.pmid, p.personIdentifier, targetAuthor
from personArticle p
left join personArticleAuthor e on e.personIdentifier = p.personIdentifier and p.pmid = e.pmid
where targetAuthor = 1 and userAssertion = 'ACCEPTED'
group by p.pmid, p.personIdentifier, targetAuthor
having count(*) > 1

16377652 alg9037 1 26869656 eaf2006 1 29683946 HAF9018 1 23705560 jaa2014 1 23574623 jat2021 1 23871589 jat2021 1 23978474 jat2021 1 24424657 jat2021 1 24651014 jat2021 1 25380486 jat2021 1 25623554 jat2021 1 26700621 jat2021 1 28169114 jat2021 1 32014047 jat2021 1 26787755 jih2002 1 10378674 mah4006 1

Examples

haa2019, 31970682

No author assigned because of lower case. Screen Shot 2020-02-21 at 7 42 53 AM

lid9035, 24680142

Invert first name and last name.

Screen Shot 2019-07-16 at 12 45 02 PM

bce2003, 9038376

Match on maiden name.

        {
            "firstName": "Barbara",
            "firstInitial": "B",
            "middleName": "Coren",
            "middleInitial": "C",
            "lastName": "Egan"
        }
                    "rank": 3,
                    "lastName": "Coren",
                    "firstName": "B A",
                    "initials": "B",
                    "targetAuthor": false
                },
...

            "evidence": {
                "acceptedRejectedEvidence": {
                    "feedbackScoreAccepted": 3
                },
                "authorNameEvidence": {
                    "institutionalAuthorName": {
                        "firstName": "Barbara",
                        "firstInitial": "B",
                        "lastName": "Egan"
                    },

mat2041, 30146489

article firstName is substring of institution firstName.

        "authorNameEvidence": {
          "institutionalAuthorName": {
            "firstName": "Maria virginia",
            "firstInitial": "M",
            "lastName": "Teijeiro"
          },
          "articleAuthorName": {
            "firstName": "Virginia",
            "firstInitial": "V",
            "lastName": "Teijeiro"
          },
          "nameScoreTotal": -4,
          "nameMatchFirstType": "full-conflictingEntirely",
          "nameMatchFirstScore": -7,
          "nameMatchMiddleType": "identityNull-MatchNotAttempted",
          "nameMatchMiddleScore": 1,
          "nameMatchLastType": "full-exact",
          "nameMatchLastScore": 2,
          "nameMatchModifierScore": 0
        },

bgharvey, 12349828

12349828

haa2019, 33056991

haa2019-33056991

ark2011, 8292578

Same initials here. Identity = Arzu Kovanlikaya Article = A. Kovanhkaya

Maybe we should do one last step where we try to match on initials.

ark2011-8292578

cah2024

For cah2024, the following papers should have Caitlin Hill selected as a targetAuthor but don't.

22655857
23727091
27264359
27076424
26697387
Screen Shot 2019-07-15 at 10 37 34 AM

markskr, 31427059, 31249846, 31220220, 32667919, 32934969

31427059

image

_csied77, 30674652

_csied77-30674652

_csied77, 30978303

Screen Shot 2020-11-25 at 6 18 52 AM

_csied77, 21415408

Screen Shot 2020-11-25 at 6 29 08 AM

rbd2001, 21742260

rbd2001 has authored 37 papers. 36 of them have no targetAuthor of any kind despite there being an exact match.

Screen Shot 2019-07-14 at 9 29 47 AM

What's especially strange is that the one targetAuthor identified (PMID = 23060188) is for a different Robert.

Screen Shot 2019-07-14 at 9 31 34 AM

jat2021, 28169114

Screen Shot 2019-07-19 at 6 56 23 PM

Jeremie Arash rafii Tabrizi, Jeremie A Rafii tabrizi, Jeremie Arash Rafii tabrizi

mgs4001 28058071 28746303

Identity Name

  • First Name: Mohamed
  • Middle Name: Gamal Eldin
  • Last Name: Sayed Ahmed

Article Name

  • First Name - Mohamed Nadeem
  • Last Name - Ahmed

Partial first name match and partial last name match.

mgs4001--28058071 28746303

ild2005, 23778512

ild2005
{
  "firstName": "Ivan",
  "firstInitial": "I",
  "middleName": "Leonardo",
  "middleInitial": "L",
  "lastName": "Diaz munoz"
}

mel2023, 19416831

Screen Shot 2019-07-20 at 2 17 22 PM

<LastName>Caldas-Lopes</LastName> <ForeName>Eloisi</ForeName>

"alternateNames": [ { "firstName": "Maria", "firstInitial": "M", "middleName": "Eloisi caldas", "middleInitial": "E", "lastName": "Lopes vazquez" }, { "firstName": "Maria", "firstInitial": "M", "lastName": "Lopes vazquez" } ],

mah4006, 29217526

<img width="1166" alt="wrong targetauthor-29217526" wrong targetauthor-29217526

mpolanec, 18767068

Update targetAuthor matching to attempt strict first name and first initial of last name match

Attempt strict first name and first initial of last name match.

if 0 results, go to next.
if 1 result, stop.
if > 1 result, assign remaining authors as false, go to next.
mpolanec-18767068

mpz2001, 27511963

Screen Shot 2019-11-01 at 3 17 18 PM

paulalbert1 avatar Jun 14 '19 12:06 paulalbert1

@sarbajitdutta - I re-ran analysis for everyone and can confirm all the changes work as expected except for the below. There is no improvement in these cases...

  1. Add attempted step as step 14: try matching identity middle name to article last name example: bce2003, 9038376
        {
          "rank": 3,
          "lastName": "Coren",
          "firstName": "B A",
          "initials": "B",
          "targetAuthor": false
        },
        {

        "authorNameEvidence": {
          "institutionalAuthorName": {
            "firstName": "Barbara",
            "firstInitial": "B",
            "lastName": "Egan"
          },
          "nameScoreTotal": 0,
          "nameMatchFirstType": "nullTargetAuthor-MatchNotAttempted",
          "nameMatchFirstScore": 0,
          "nameMatchMiddleType": "nullTargetAuthor-MatchNotAttempted",
          "nameMatchMiddleScore": 0,
          "nameMatchLastType": "nullTargetAuthor-MatchNotAttempted",
          "nameMatchLastScore": 0,
          "nameMatchModifierScore": 0
        },
  1. Add attempted step as step 15: try matching first to last name, and last to first name example: lid9035, 24680142
        },
        {
          "rank": 4,
          "lastName": "Litong",
          "firstName": "Du",
          "initials": "D",
          "targetAuthor": false
        }
      ],
        "authorNameEvidence": {
          "institutionalAuthorName": {
            "firstName": "Litong",
            "firstInitial": "L",
            "lastName": "Du"
          },
          "nameScoreTotal": 0,
          "nameMatchFirstType": "nullTargetAuthor-MatchNotAttempted",
          "nameMatchFirstScore": 0,
          "nameMatchMiddleType": "nullTargetAuthor-MatchNotAttempted",
          "nameMatchMiddleScore": 0,
          "nameMatchLastType": "nullTargetAuthor-MatchNotAttempted",
          "nameMatchLastScore": 0,
          "nameMatchModifierScore": 0
        },
  1. Add attempted step as step 17: try matching identity lastname to article first name, and identity first name initial to article last name initial example: bgharvey, 12349828
"reCiterArticleAuthorFeatures": [
        {
          "rank": 1,
          "lastName": "Ben-Gary",
          "firstName": "Harvey",
          "initials": "H",
          "email": "[email protected]",
          "targetAuthor": false
        },
        {
          "rank": 2,
          "lastName": "McKinney",
          "firstName": "Robin L",
          "initials": "R",
          "targetAuthor": false
        },
        {
          "rank": 3,
          "lastName": "Rosengart",
          "firstName": "Todd",
          "initials": "T",
          "targetAuthor": false
        },
        {
          "rank": 4,
          "lastName": "Lesser",
          "firstName": "Martin L",
          "initials": "M",
          "targetAuthor": false
        },
        {
          "rank": 5,
          "lastName": "Crystal",
          "firstName": "Ronald G",
          "initials": "R",
          "targetAuthor": false
        }
      ],
        "authorNameEvidence": {
          "institutionalAuthorName": {
            "firstName": "Bengary",
            "firstInitial": "B",
            "lastName": "Harvey"
          },
          "nameScoreTotal": 0,
          "nameMatchFirstType": "nullTargetAuthor-MatchNotAttempted",
          "nameMatchFirstScore": 0,
          "nameMatchMiddleType": "nullTargetAuthor-MatchNotAttempted",
          "nameMatchMiddleScore": 0,
          "nameMatchLastType": "nullTargetAuthor-MatchNotAttempted",
          "nameMatchLastScore": 0,
          "nameMatchModifierScore": 0
        },
  1. Add attempted step as step 18: try matching identity first name initial to article first name initial, and identity last name initial to article last name initial example: ark2011, 8292578 example: _csied77, 30674652, 30978303, 21415408
      "reCiterArticleAuthorFeatures": [
        {
          "rank": 1,
          "lastName": "Akgür",
          "firstName": "F M",
          "initials": "F",
          "targetAuthor": false
        },
        {
          "rank": 2,
          "lastName": "Aktuğ",
          "firstName": "T",
          "initials": "T",
          "targetAuthor": false
        },
        {
          "rank": 3,
          "lastName": "Kovanhkaya",
          "firstName": "A",
          "initials": "A",
          "targetAuthor": false
        },
        {
          "rank": 4,
          "lastName": "Erdağ",
          "firstName": "G",
          "initials": "G",
          "targetAuthor": false
        },
        {
          "rank": 5,
          "lastName": "Olguner",
          "firstName": "M",
          "initials": "M",
          "targetAuthor": false
        },
        {
          "rank": 6,
          "lastName": "Hoşgör",
          "firstName": "M",
          "initials": "M",
          "targetAuthor": false
        },
        {
          "rank": 7,
          "lastName": "Obuz",
          "firstName": "O",
          "initials": "O",
          "targetAuthor": false
        }
        "authorNameEvidence": {
          "institutionalAuthorName": {
            "firstName": "Arzu",
            "firstInitial": "A",
            "lastName": "Kovanlikaya"
          },
          "nameScoreTotal": 0,
          "nameMatchFirstType": "nullTargetAuthor-MatchNotAttempted",
          "nameMatchFirstScore": 0,
          "nameMatchMiddleType": "nullTargetAuthor-MatchNotAttempted",
          "nameMatchMiddleScore": 0,
          "nameMatchLastType": "nullTargetAuthor-MatchNotAttempted",
          "nameMatchLastScore": 0,
          "nameMatchModifierScore": 0
        },        
  1. Add attempt step as step 19: try matching article first name as substring of identity middle name. example: mel2023, 19416831
      "reCiterArticleAuthorFeatures": [
        {
          "rank": 1,
          "lastName": "Caldas-Lopes",
          "firstName": "Eloisi",
          "initials": "E",
          "targetAuthor": false
        },
        {
          "rank": 2,
          "lastName": "Cerchietti",
          "firstName": "Leandro",
          "initials": "L",
          "targetAuthor": false
        }
      ],
      "volume": "106",
      "issue": "20",
      "pages": "8368-73",
      "evidence": {
        "acceptedRejectedEvidence": {
          "feedbackScoreAccepted": 1.5
        },
        "authorNameEvidence": {
          "institutionalAuthorName": {
            "firstName": "Maria",
            "firstInitial": "M",
            "lastName": "Lopes vazquez"
          },
          "nameScoreTotal": 0,
          "nameMatchFirstType": "nullTargetAuthor-MatchNotAttempted",
          "nameMatchFirstScore": 0,
          "nameMatchMiddleType": "nullTargetAuthor-MatchNotAttempted",
          "nameMatchMiddleScore": 0,
          "nameMatchLastType": "nullTargetAuthor-MatchNotAttempted",
          "nameMatchLastScore": 0,
          "nameMatchModifierScore": 0
        },

paulalbert1 avatar Aug 12 '21 18:08 paulalbert1