science-on-schema.org recommendation for indicating authoritative copy of dataset

Many datasets are present in multiple catalogs, including the original provider, the current host of the dataset, and at multiple aggregator sites that might maintain landing pages (e.g., at DataONE, data.gov, Cinergi, etc.). Aggregators like Google Dataset search harvest these entries from their multiple landing pages, and show where the dataset might be accessed in their listing. However, there is no indication of which of the sites maintains the authoritative copy of the dataset. For example, here's a view that shows three locations, but shows the DataCite logo even though Arctic Data Center is the authoritative holder of these data.

While our dataset schema.org entry can specify includedInCatalog as part of its entry, that doesn't indicate which is the authoritative catalog/repository for the dataset. There is also some ambiguity over what the meaning of publisher is for these entries when the same data set can be published by multiple organizations. I'm also unclear what fields are used when generating the Dataset provided by display on Google Dataset Search, which sometimes lists one of the locations, and sometimes lists multiple. In the example above, only 2 of the three replica locations are shown. I suggest that we need a specific field that indicates authoritativePublisher or authoritativeRepository unless there is an existing term that plays that role. What is our recommendation for this concept?

Nov 20 '19 11:11 mbjones

This is a good topic to bring up.

There is also some ambiguity over what the meaning of publisher is for these entries when the same data set can be published by multiple organizations.

There are tricky semantics here, especially given Schema.org's definition is pretty slim:

http://schema.org/publisher: The publisher of the creative work.

But as a note, DataCite does a nice job in their JSON-LD (which Google seems to actually ignore here). Here's the JSON-LD for your example, available at https://api.datacite.org/dois/application/vnd.schemaorg.ld+json/10.18739/a2dz03215, I see:

  "publisher": {
    "@type": "Organization",
    "name": "Arctic Data Center"
  },
  "provider": {
    "@type": "Organization",
    "name": "DataCite"
  }

So maybe publisher is a good fit for describing the authority and provider can be used for any copies?

Nov 26 '19 04:11 amoeba

The trick is binding between the URL for accessing the resource and the publisher (authoritative, if we decide to make that the convention) or provider. This could be done by putting the information in distribution/DataDownload/publisher or distribution/DataDownload/provider (different distributions) to indicate 'authoritative' and 'alternate' sources.

Nov 26 '19 17:11 smrgeoinfo

see related BoF: https://docs.google.com/document/d/17hrcLpxcAA3_U3MZ3sWHrbaeNVa1k8yaigBzbjmxFHk/edit

Dec 19 '19 20:12 ashepherd

This issue has been automatically marked as stale because it has not had recent activity.

Feb 17 '20 21:02 stale[bot]

@ashepherd invite ESIP Data Citation WG to weigh-in on this issue.

Apr 06 '20 21:04 ashepherd

As a data point, a paper on how Google Dataset Search handles the publisher field provides the following:

Providers: There is some ambiguity in schema.org on how to specify the the source of a dataset. We use the so#publisher and so#provider properties to identify the organization that provided the dataset. As with other properties, the value may be a string or an Organization object. Wherever possible, we reconcile the organization to the corresponding entity in the Google Knowledge Graph.

See:

@proceedings{49385,
title	= {Google Dataset Search by the Numbers},
editor	= {Omar Benjelloun and Shiyu Chen and Natasha Noy},
year	= {2020},
URL	= {https://arxiv.org/abs/2006.06894},
booktitle	= {International Semantic Web Conference (ISWC-2020), In-Use Track}
}

Nov 04 '20 23:11 mbjones

The display of the original dataset that I used as an example has now changed at Google Dataset Search, and it correctly lists the Arctic Data Center as the provider. We didn't change our schema.org metadata, so maybe something changed on the Google harvesting end.

May 06 '21 02:05 mbjones

TO-DOs (ESIP Summer Meeting)

Clarify the meaning of publisher and provider terms and then look at use cases for when two different copies of the data are actually the same, how to represent in schema.org markup
Review paper: https://datascience.codata.org/articles/10.5334/dsj-2021-012/
In the guidelines, make it clear the difference between same dataset and need to use the provenance relationships for derivation

Jul 23 '21 20:07 ashepherd

Notes from 8/26/2021 meeting:

Proposing we not delay v1.3 for this issue
[Doug, Nick] Is this just the cited identifier minted by the original publisher?
Doug will try to talk to N. Noy about how she approaches this
[Adam, Nick, Doug Chantelle] Think that publisher and provider are adequate from the perspective of the schema.org document maker for what they might possibly know about a dataset. Publisher and provider provide adequate credit/attribution to curators and service provider(s) known to this schema.org document maker. It's not clear how this will be used?
[Doug, Adam] Google is making this assertion based on how they interpret their harvested graph. Maybe this is the responsibility of the harvester based on the information they have
OceanInfoHub harvesting strategy is to use social constructs within the community to govern issues of attribution/credit.

Aug 28 '21 01:08 ashepherd