codemetapy
codemetapy copied to clipboard
Add support for ORCIDs
Authors are best identified by their ORCID. We ideally need a way of resolving user emails to orcids automatically (does their API offer such a function?).
Yes, it does: https://info.orcid.org/faq/how-do-i-find-orcid-record-holders-at-my-institution/ BUT (this is what I figured could be wrong): emails of users are per default not visible to the outside, a member has to upgrade this to either internal or public on a per email level. So only if people have done this you have a chance to find them via an authorized query to the API by email. I think most people do not change the default, so i expect this way to yield 10%. (test query https://pub.orcid.org/v3.0/csv-search/?q=affiliation-org-name:ORCID&fl=orcid,given-names,family-name,current-institution-affiliation-name,email)
A better way could be to find people over name, plus affiliation, i.e. institution name or identifier. Here codemetapy probably only has a chance if the institution is given or it can get it from the metadata already there... How to do this I do not know, since contributors can be from everywhere, maybe a first thing would be to allow for a list to try.
Let me know if you plan to work on this. I have a layout of what I want, but not implemented anything yet and it is currently not on my todo list
in terms of code out there I found this which is old and may or may not work: https://github.com/ORCID/python-orcid https://github.com/scholrly/orcid-python
emails of users are per default not visible to the outside, a member has to upgrade this to either internal or public on a per email > level. So only if people have done this you have a chance to find them via an authorized query to the API by email. I think most > people do not change the default, so i expect this way to yield 10%.
Too bad, this would be the ideal method but if it yields only 10% it's not very useful indeed.
A better way could be to find people over name, plus affiliation, i.e. institution name or identifier.
That sounds viable yes, though one issue with affiliations is that people tend to come and go in institutions.
..maybe a first thing would be to allow for a list to try.
Like explicitly passing a tsv file to codemetapy with say emails and orcids? That would work yes, though it isn't as fully automated as we'd want ideally.
An add on to this. codemetapy parses the Citation.cff file, but it does not use the orcids in there for authors/contributors Ids but instead the gitlab id (account page) "@id": "https://iffgit.fz-juelich.de/fleur/fleur/person/cmax347"
.
Ideally once would keep both information... i.e that the orcid and the git id are same as somewhere.
also in that context the familyName
and givenName
parsing is also not optimal if the link of the person does not contain the name, example:
{
"@id": "https://iffgit.fz-juelich.de/fleur/fleur/person/cmax347",
"@type": "Person",
"email": "[email protected]@gmail.com",
"familyName": "",
"givenName": "cMax347",
"position": 71
},
{
"@id": "https://iffgit.fz-juelich.de/fleur/fleur/person/christian-roman-gerhorst",
"@type": "Person",
"email": "[email protected]",
"familyName": "Gerhorst",
"givenName": "Christian-Roman",
"position": 72
}
So it has also problems with middle names. I would assume that these would be easier to parse from an Citation.cff file.
An add on to this. codemetapy parses the Citation.cff file, but it does not use the orcids in there for authors/contributors Ids
Hmm.. Agreed, if there are ORCIDs then they shouldn't be overwritten. I wonder if it's an issue in codemetapy or in https://github.com/citation-file-format/cff-converter-python, we don't do the CITATION.cff parsing ourselves.
but instead the gitlab id (account page) "@id": "https://iffgit.fz-juelich.de/fleur/fleur/person/cmax347".
(it's not the gitlab id, see #34)
Ideally once would keep both information... i.e that the orcid and the git id are same as somewhere.
also in that context the familyName and givenName parsing is also not optimal if the link of the person does not contain the name, example:
{ "@id": "https://iffgit.fz-juelich.de/fleur/fleur/person/cmax347", "@type": "Person", "email": "[email protected]@gmail.com", "familyName": "", "givenName": "cMax347", "position": 71 },
Yes, we'd better just use schema:name
if we can't decipher given and family names, needs some fine-tuning. That e-mail looks malformed too.
For the actual name parsing from arbitrary strings I'm using nameparser
I've been giving this some more thought and there are some challenges to solve, mostly related to 'affiliations':
- In the current implementation, whenever an author appears in multiple software metadata projects (or even multiple times in the same one), there is a high risk of properties getting conflated if not consistently named. The most notable one is 'affiliation'. If an author at various points has different affiliations (or even the same one but not consistently named). Then these will all be propagated to all instances when the full graph of multiple software projects is loaded.
- Related to the above: 'affiliation' is a property of a
schema:Person
. But that means it is no longer attached to any specific software project, meaning we can't differentiate between affiliations at the time of the sofware project or later/before. We'd always get all of them, which may be less informative than desired. It's common for people to have (had) multiple affiliations throughout their career. We do useschema:producer
to tie software projects to institutions directly, so at least that is expressable (relates to codemeta/codemeta#286) - We already ascertained that automatically going from names or e-mails to ORCIDs is hard. We probably need a custom mapping as input (like a tsv file).
- The reverse, going from ORCIDs to all the names/emails/urls is fairly easy, we can
just query
orcid.org
and requestapplication/ld+json
to get a schema.org representation that is compatible with codemeta. Some caveats there: * It does not contain the e-mail, even if it is public. The turtle output, however, does (it uses a completely different vocabulary than the JSON-LD serialisation) * The JSON-LD output lists all affiliations it knows (including those that have ended, but that information is not outputted). The turtle output lists no affiliations at all.
Possibly relevant: ORCID profiles can be tied to Github accounts. If the GitHub API exposes this it provides a nice way to find ORCIDs.
See https://scicomm.xyz/@ORCID_Org/112282433046701907