ReCiter
ReCiter copied to clipboard
Downweight cases where org unit doesn't match
Background
There are a number of cases where a user will have org units in their profile and they don't even come close to matching the org unit on file. To this point, we've ignored such cases. But maybe we can use this data to cut down on false positives.
An example is personIdentifier = sue2002 and PMID = 36630615. Psychiatry (sue2002's org unit) is very different than Cell and Developmental Biology.
For our data set, I estimate this will improve accuracy by 0.5%, by reducing the number of false positives. But given our use of organizational synonyms, the only way to tell for certain would be to run this for everyone.
Requirements
This Java file outputs in part a value called strategy.orgUnitScoringStrategy.organizationalUnitDepartmentMatchingScore. This is for a positive departmental match. I want to update the code so it also outputs a organizationalUnitDepartmentNegativeMatchingScore in these circumstances:
- identity.getOrganizationalUnits() != null
- articleAffiliation != null
- The words "Department of ", "Division of ", etc. exist in articleAffiliation string but that match fails.
See this PR. It hasn't been "tested" and it probably doesn't "work," but I think it's on the right track.
Here's how a particular downweight affects the number of true / false positives / negatives. This is from a set of ~200,000 articles.
0 (downweight) - 7657 (error count)
FALSE NEGATIVE 3779
FALSE POSITIVE 3878
TRUE NEGATIVE 11094
TRUE POSITIVE 26427
0.1 - 7560
FALSE NEGATIVE 3976
FALSE POSITIVE 3584
TRUE NEGATIVE 11388
TRUE POSITIVE 26230
0.2 - 7442
FALSE NEGATIVE 4193
FALSE POSITIVE 3249
TRUE NEGATIVE 11723
TRUE POSITIVE 26013
0.3 - 7279
FALSE NEGATIVE 4445
FALSE POSITIVE 2834
TRUE NEGATIVE 12138
TRUE POSITIVE 25761
0.4 - 7303
FALSE NEGATIVE 4675
FALSE POSITIVE 2628
TRUE NEGATIVE 12344
TRUE POSITIVE 25531
0.5 - 7374
FALSE NEGATIVE 5051
FALSE POSITIVE 2323
TRUE NEGATIVE 12649
TRUE POSITIVE 25155
Test case
The combination of personIdentifier = sue2002 and PMID = 36630615 should return this...
"organizationalUnitEvidence": [
{
"identityOrganizationalUnit": "Payne Whitney (Psychiatry)",
"articleAffiliation": "Department of Cell and Developmental Biology, University College London, London, UK.",
"organizationalUnitType": "DEPARTMENT",
"organizationalUnitMatchingScore": -0.4,
"organizationalUnitModifierScore": 0
}
],