ReCiter icon indicating copy to clipboard operation
ReCiter copied to clipboard

Downweight cases where org unit doesn't match

Open paulalbert1 opened this issue 1 year ago • 0 comments

Background

There are a number of cases where a user will have org units in their profile and they don't even come close to matching the org unit on file. To this point, we've ignored such cases. But maybe we can use this data to cut down on false positives.

An example is personIdentifier = sue2002 and PMID = 36630615. Psychiatry (sue2002's org unit) is very different than Cell and Developmental Biology.

Screenshot 2023-11-07 at 5 25 19 PM

For our data set, I estimate this will improve accuracy by 0.5%, by reducing the number of false positives. But given our use of organizational synonyms, the only way to tell for certain would be to run this for everyone.

Requirements

This Java file outputs in part a value called strategy.orgUnitScoringStrategy.organizationalUnitDepartmentMatchingScore. This is for a positive departmental match. I want to update the code so it also outputs a organizationalUnitDepartmentNegativeMatchingScore in these circumstances:

  1. identity.getOrganizationalUnits() != null
  2. articleAffiliation != null
  3. The words "Department of ", "Division of ", etc. exist in articleAffiliation string but that match fails.

See this PR. It hasn't been "tested" and it probably doesn't "work," but I think it's on the right track.

Here's how a particular downweight affects the number of true / false positives / negatives. This is from a set of ~200,000 articles.

0 (downweight) - 7657 (error count)

FALSE NEGATIVE	3779
FALSE POSITIVE	3878
TRUE NEGATIVE	11094
TRUE POSITIVE	26427


0.1 - 7560

FALSE NEGATIVE	3976
FALSE POSITIVE	3584
TRUE NEGATIVE	11388
TRUE POSITIVE	26230


0.2 - 7442

FALSE NEGATIVE	4193
FALSE POSITIVE	3249
TRUE NEGATIVE	11723
TRUE POSITIVE	26013


0.3 - 7279

FALSE NEGATIVE	4445
FALSE POSITIVE	2834
TRUE NEGATIVE	12138
TRUE POSITIVE	25761


0.4 - 7303

FALSE NEGATIVE	4675
FALSE POSITIVE	2628
TRUE NEGATIVE	12344
TRUE POSITIVE	25531


0.5 - 7374

FALSE NEGATIVE	5051
FALSE POSITIVE	2323
TRUE NEGATIVE	12649
TRUE POSITIVE	25155

Test case

The combination of personIdentifier = sue2002 and PMID = 36630615 should return this...

        "organizationalUnitEvidence": [
          {
            "identityOrganizationalUnit": "Payne Whitney (Psychiatry)",
            "articleAffiliation": "Department of Cell and Developmental Biology, University College London, London, UK.",
            "organizationalUnitType": "DEPARTMENT",
            "organizationalUnitMatchingScore": -0.4,
            "organizationalUnitModifierScore": 0
          }
        ],

paulalbert1 avatar Nov 07 '23 23:11 paulalbert1