idr-metadata Study publication: metadata unification

Status

The gallery UI work carried in prod67 (see https://github.com/openmicroscopy/design/issues/100 and image.sc post) also drove the re-annotation of published IDR studies. In particular the Study Type and Study Public Release Date metadata fields were reviewed across all studies and a new Sample Type field was added to classify each study as cell or tissue.

Metadata that was discussed but not fixed/rationalized in prod67 was the Publication Authors. At the moment, we support different naming schemes and downstream consumers like the gallery UI needs to handle these variants.

Proposal

All IDR studies with an associated peer-reviewed publication have a PubMed ID. A natural proposal would be to unify the author naming scheme to comply with what PubMed store.

To minimize the impact on submitters, templates should be updated with the recommended formatting for Study Author List values as LastName 1 Initials1, LastName2 Initials2,.... The author list should be stored as a comma separated list of authors e.g.

Walther N, Hossain MJ, Politi AZ, Koch B, Kueblbeck M, Ødegård-Fougner Ø, Lampe M, Ellenberg J

Validation

The NCBI API can be used for validating a lot of the publication metadata (title, authors, PMC and DOI if applicable) given a PubMed ID:

+    def validate_publications(self):
+       URL = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi"
+       QUERY = "?db=pubmed&id=%s&retmode=json" 
+
+       for publication in self.study["Publications"]:
+           if "PubMed ID" not in publication:
+               continue
+           json = requests.get(URL + QUERY % publication["PubMed ID"]).json()
+           result = json['result'][publication["PubMed ID"]]
+
+           self.log.debug("Validating publication title")
+           assert publication["Title"] == result['title'], "%s != %s" % (
+               publication["Title"], result['title'])
+
+           self.log.debug("Validating publication author")
+           assert publication["Title"] == result['title'], "%s != %s" % (
+               publication["Title"], result['title'])
+
+           # Validate PMC ID and DOI if present
+           for articleid in result['articleids']:
+               articleids_map = {"pmc": "PMC ID", 'doi': "DOI"}
+               if articleid['idtype'] in articleids_map.keys():
+                   study_key = articleids_map[articleid['idtype']]
+                   self.log.debug("Validating %s" % study_key)
+                   assert publication[study_key] == articleid['value'], (
+                       "%s != %s" % (
+                       publication[study_key], articleid['value']))

Database and UI representation

At the moment, publications are included in the idr.openmicroscopy/study/info annotation as an ordered list of key/value pairs (Title, Authors, PubMed ID, PMC ID if applicable, DOI if applicable), one per publication:

In order for the gallery or any downstream application to consume this metadata effectively, we might need to rethink how to store and expose the publication metadata

should authors be listed as one key/value pair with comma separated authors or one key/value pair per author?
should publications be moved to their own map annotation with an idr.openmicroscopy.org/study/publication namespace? Should multiple publications be combined or as separate map annotations?

Jun 18 '19 13:06 sbesson

If a Pubmed ID is supplied could we dispense with a lot of the other related metadata and pull it out automatically using the PubMed API?

For authors I think either

multiple K-V pairs, one K-V per author
one MapAnn per author containing separate additional fields e.g. ORCID for that author.

Jun 18 '19 13:06 manics

If PubMed ID is supplied, I would minimally update the parser to ensure the metadata is consistent with the PubMed API. Unclear about dispensing it though especially as most studies come prior to peer-reviewed acceptance anyways.

The main problem I see with one map annotation per author is the case of studies with multiple publications (like the one above) as you lose the author/publication relationship.

Jun 18 '19 14:06 sbesson

The main problem I see with one map annotation per author is the case of studies with multiple publications (like the one above) as you lose the author/publication relationship.

True, but the purpose of the IDR is to publish datasets, not publications. I think it's reasonable to say that the reason for including individual authors is so you can lookup a dataset associated with them, I can't think of a good usecase where someone would want to go author ⇔ publication, as opposed to author ⇔ dataset / publication ⇔ dataset, in the IDR.

Jun 18 '19 14:06 manics

Extensively discussed the relationship between study and authors this morning with @jburel @jrswedlow @francesw @dominikl @pwalczysko and @will-moore . Below is a summary of the current IDR model:

each study is associated to an arbitrary number of publications. The majority of published studies have a one associated publication but some have zero associated publication (idr0018) and others have many (idr0004, idr0016)
our metadata templates/study files include aStudy Authors List concept. So far, this metadata field has been (mis?)used to capture the authors for each publication associated with the studies as tab-separated author lists
the study file includes the concetp of a Study Copyright associated with the licensing of the dataset (usually CC-BY)
the UI representation currently allows to search for all publication authors including potential duplicates e.g. https://github.com/IDR/idr-metadata/pull/380#discussion_r314968526

From the discussion, there is a general agreement in the value of modelling, capturing and representing the concept of Study Authors. In a large majority of the studies, this might be similar to the authors of the associated publication but this needs more design. A few immediate questions:

study file specification: support both Study Authors List and Study Publication Authors List? optional or mandatory? upgrade path for all studies if we decide to introduce a new field?
key/value pair representation - add another key/value to the map annotation? replace publication authors? create separate key/value pairs per author?
searching ability: allow to search by Study Authors as well Publication Authors?

Aug 19 '19 14:08 sbesson

idr-metadata idr-metadata copied to clipboard

Study publication: metadata unification

Status

Proposal

Validation

Database and UI representation

idr-metadata
idr-metadata copied to clipboard