idr-metadata
idr-metadata copied to clipboard
Study publication: metadata unification
Status
The gallery UI work carried in prod67
(see https://github.com/openmicroscopy/design/issues/100 and image.sc post) also drove the re-annotation of published IDR studies. In particular the Study Type
and Study Public Release Date
metadata fields were reviewed across all studies and a new Sample Type
field was added to classify each study as cell
or tissue
.
Metadata that was discussed but not fixed/rationalized in prod67
was the Publication Authors. At the moment, we support different naming schemes and downstream consumers like the gallery UI needs to handle these variants.
Proposal
All IDR studies with an associated peer-reviewed publication have a PubMed ID. A natural proposal would be to unify the author naming scheme to comply with what PubMed store.
To minimize the impact on submitters, templates should be updated with the recommended formatting for Study Author List
values as LastName 1 Initials1, LastName2 Initials2,...
. The author list should be stored as a comma separated list of authors e.g.
Walther N, Hossain MJ, Politi AZ, Koch B, Kueblbeck M, Ødegård-Fougner Ø, Lampe M, Ellenberg J
Validation
The NCBI API can be used for validating a lot of the publication metadata (title, authors, PMC and DOI if applicable) given a PubMed ID:
+ def validate_publications(self):
+ URL = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi"
+ QUERY = "?db=pubmed&id=%s&retmode=json"
+
+ for publication in self.study["Publications"]:
+ if "PubMed ID" not in publication:
+ continue
+ json = requests.get(URL + QUERY % publication["PubMed ID"]).json()
+ result = json['result'][publication["PubMed ID"]]
+
+ self.log.debug("Validating publication title")
+ assert publication["Title"] == result['title'], "%s != %s" % (
+ publication["Title"], result['title'])
+
+ self.log.debug("Validating publication author")
+ assert publication["Title"] == result['title'], "%s != %s" % (
+ publication["Title"], result['title'])
+
+ # Validate PMC ID and DOI if present
+ for articleid in result['articleids']:
+ articleids_map = {"pmc": "PMC ID", 'doi': "DOI"}
+ if articleid['idtype'] in articleids_map.keys():
+ study_key = articleids_map[articleid['idtype']]
+ self.log.debug("Validating %s" % study_key)
+ assert publication[study_key] == articleid['value'], (
+ "%s != %s" % (
+ publication[study_key], articleid['value']))
Database and UI representation
At the moment, publications are included in the idr.openmicroscopy/study/info
annotation as an ordered list of key/value pairs (Title, Authors, PubMed ID, PMC ID if applicable, DOI if applicable), one per publication:

In order for the gallery or any downstream application to consume this metadata effectively, we might need to rethink how to store and expose the publication metadata
- should authors be listed as one key/value pair with comma separated authors or one key/value pair per author?
- should publications be moved to their own map annotation with an
idr.openmicroscopy.org/study/publication
namespace? Should multiple publications be combined or as separate map annotations?
If a Pubmed ID is supplied could we dispense with a lot of the other related metadata and pull it out automatically using the PubMed API?
For authors I think either
- multiple K-V pairs, one K-V per author
- one MapAnn per author containing separate additional fields e.g. ORCID for that author.
If PubMed ID is supplied, I would minimally update the parser to ensure the metadata is consistent with the PubMed API. Unclear about dispensing it though especially as most studies come prior to peer-reviewed acceptance anyways.
The main problem I see with one map annotation per author is the case of studies with multiple publications (like the one above) as you lose the author/publication relationship.
The main problem I see with one map annotation per author is the case of studies with multiple publications (like the one above) as you lose the author/publication relationship.
True, but the purpose of the IDR is to publish datasets, not publications. I think it's reasonable to say that the reason for including individual authors is so you can lookup a dataset associated with them, I can't think of a good usecase where someone would want to go author ⇔ publication, as opposed to author ⇔ dataset / publication ⇔ dataset, in the IDR.
Extensively discussed the relationship between study and authors this morning with @jburel @jrswedlow @francesw @dominikl @pwalczysko and @will-moore . Below is a summary of the current IDR model:
- each study is associated to an arbitrary number of publications. The majority of published studies have a one associated publication but some have zero associated publication (
idr0018
) and others have many (idr0004
,idr0016
) - our metadata templates/study files include a
Study Authors List
concept. So far, this metadata field has been (mis?)used to capture the authors for each publication associated with the studies as tab-separated author lists - the study file includes the concetp of a
Study Copyright
associated with the licensing of the dataset (usually CC-BY) - the UI representation currently allows to search for all publication authors including potential duplicates e.g. https://github.com/IDR/idr-metadata/pull/380#discussion_r314968526
From the discussion, there is a general agreement in the value of modelling, capturing and representing the concept of Study Authors. In a large majority of the studies, this might be similar to the authors of the associated publication but this needs more design. A few immediate questions:
- study file specification: support both
Study Authors List
andStudy Publication Authors List
? optional or mandatory? upgrade path for all studies if we decide to introduce a new field? - key/value pair representation - add another key/value to the map annotation? replace publication authors? create separate key/value pairs per author?
- searching ability: allow to search by
Study Authors
as wellPublication Authors
?