dataverse
dataverse copied to clipboard
Improve/update Schema.org JSON-LD export
In a meeting with folks from the FAIRsFAIR group (namely @kitchenprinzessin3880) who are building and testing tools to access the "FAIRNESS" of datasets in Dataverse repositories (https://www.fairsfair.eu/fairsfair-data-object-assessment-metrics-request-comments), some changes were recommended for the metadata that Dataverse includes in the Schema.org JSON-LD metadata it exports for datasets. I said I'd open a Github issue so we could record and explain these changes.
For license property, use the @type
"CreativeWorks" and use "name" instead of "text":
As of Dataverse 5.1.1, the @type
for the Schema.org property "license" is "Dataset". Here's an example of what that looks like:
license: {
@type: "Dataset",
text: "CC0",
url: "https://creativecommons.org/publicdomain/zero/1.0/"
or if CC0 is waived:
license: {
@type: "Dataset",
text: "Text the depositor entered in the Terms of Use field"
Google's guide for describing datasets with Schema.org says to use the "CreativeWorks" @type
for license and use "name".
Here's an example of what the license metadata in the Schema.org export might look like when this issue is merged (after the "multiple license" work described at https://github.com/IQSS/dataverse/issues/7440 and https://github.com/IQSS/dataverse/issues/7742 is also merged):
If the dataset depositor chooses a license from the list of licenses:
license: {
@type: "CreativeWorks",
name: "CC0",
url: "https://creativecommons.org/publicdomain/zero/1.0/"
license: {
@type: "CreativeWorks",
name: "CC BY",
url: "https://creativecommons.org/licenses/by/4.0/"
Or if no license is chosen and a custom license is entered:
license: {
@type: "CreativeWorks",
name: "Text entered in the "Dataset Terms" fields"
For files (in the "distribution" property): As of Dataverse 5.1.1, here's an example of what the file metadata in the Schema.org export looks like:
distribution: [
{
@type: "DataDownload",
name: "cases_by_infection.tab",
fileFormat: "text/tab-separated-values",
contentSize: 56377,
description: "",
@id: "https://doi.org/10.70122/FK2/LQKU61/SCM19X",
identifier: "https://doi.org/10.70122/FK2/LQKU61/SCM19X",
contentUrl: "https://demo.dataverse.org/api/access/datafile/1653135"
{
@type: "DataDownload",
name: "DatasetDiagram.png",
fileFormat: "image/png",
contentSize: 84006,
description: "",
@id: "https://doi.org/10.70122/FK2/LQKU61/SBJENW",
identifier: "https://doi.org/10.70122/FK2/LQKU61/SBJENW"
Here are the changes related to file metadata being proposed in this GitHub issue:
-
Use "encodingFormat" instead of "fileFormat": Google's guide for describing datasets with Schema.org says to use the property "encodingFormat" (doesn't mention using the "fileFormat" property)
-
contentURL should always be added: As of Dataverse 5.1.1, Dataverse puts each file's "download URL" in Schema.org's contentURL property as long as the file isn't restricted or its dataset has no guestbook or Terms of Use metadata. (See details about the current logic at https://github.com/IQSS/dataverse/issues/4371#issuecomment-436762935)
Instead, Dataverse should always include every file's "download URL" in Schema.org's contentURL property. Then if the file is restricted or its dataset has a guestbook or Terms of Access metadata, the download URL will return the access restricted error that it returns now.
-
Add conditionsOfAccess to declare that a file is open or restricted: @kitchenprinzessin3880 pointed to two vocabularies whose terms we might consider using as values for conditionsOfAccess, to indicate how accessible the file is: https://guidelines.openaire.eu/en/latest/literature/field_accesslevel.html and http://vocabularies.coar-repositories.org/documentation/access_rights.
Each vocab defines four terms. I've written in https://github.com/IQSS/dataverse/issues/5920 about current problems Dataverse has with using the Access Rights terms from the info:eu-repo namespace, so I'm hesitant to use those terms. To put it briefly, Dataverse has files that are restricted using Dataverse's file restriction feature and the "File Request" feature is disabled, but the depositor uses a process outside of Dataverse to manage access to the file. So the file is restricted, not "closedAccess," even though people aren't able to request access to the file through Dataverse's "File Request" feature. Most of the datasets in Harvard Dataverse's Murray collections are like this (e.g. there's a process outside of the Dataverse software for requesting access to restricted files in https://doi.org/10.7910/DVN/0PMZC6). Maybe we can discuss that in this issue.
Here's an example of what the file metadata in the Schema.org export might look like when a pull request for this issue is merged:
distribution: [
{
@type: "DataDownload",
name: "cases_by_infection.tab",
encodingFormat: "text/tab-separated-values",
contentSize: 56377,
description: "",
@id: "https://doi.org/10.70122/FK2/LQKU61/SCM19X",
identifier: "https://doi.org/10.70122/FK2/LQKU61/SCM19X",
contentUrl: "https://demo.dataverse.org/api/access/datafile/1653135"
conditionsOfAccess: (to be determined)
{
@type: "DataDownload",
name: "DatasetDiagram.png",
encodingFormat: "image/png",
contentSize: 84006,
description: "",
@id: "https://doi.org/10.70122/FK2/LQKU61/SBJENW",
identifier: "https://doi.org/10.70122/FK2/LQKU61/SBJENW"
contentUrl: "https://demo.dataverse.org/api/access/datafile/26"
conditionsOfAccess: (to be determined)
For adding conditionsOfAccess for the file metadata, should the values be binary, e.g. open and close, like the following?:
- open (or a term like it) means there are no barriers to programmatic access to the file, e.g. the contentUrl works
- closed (or a term like it) means there are barriers to programmatic access to the file, e.g. the contentUrl does not work
@jggautier if you plan to use binary, maybe this property is more appropriate? https://schema.org/isAccessibleForFree
That makes sense to me! I think that if we use that property this way, since you've been following this issue closely, you (and the tools you're helping develop) will know what it means for a file to be "isAccessibleForFree". Hopefully others who need to use this metadata will also be able to figure how it's being used.
The Google Research group writes on page 3 of their "Google Dataset Search by the Numbers" article that the property "is a boolean value that indicates whether or not the dataset requires a payment", but then they describe how Google Dataset Search interprets a True value to mean "open" and similar to any of the "Creative Commons and open government licenses". So I think it's fair to expect that their interpretation, applied at the dataset level, should be applied at the file level, too, right? So it shouldn't be hard for others who need to use this metadata to figure out that a file flagged as "isAccessibleForFree" is open to some degree, although the exact degree (programmatic access to the file) might not be apparent by just looking at the metadata.
So it shouldn't be hard for others who need to use this metadata to figure out that a file flagged as "accessibleForFree" is open to some degree, although the exact degree (programmatic access to the file) might not be apparent by just looking at the metadata.
what is the @type
at the file level? as long as it is sub-type of creative works, the property can be applied.
btw, the tool accepts both schema.org properties (accessibleForFree, conditionsOfAccess) which may be used to indicate access-level metadata of a dataset.
Isn't "DataDownload" the @type
at the file level? That's what's used in this issue's first comment. https://schema.org/DataDownload lists isAccessibleForFree, so I think it can be applied then?
I meant more that if I was looking to use the metadata to build a tool or query the repository and saw isAccessibleForFree: True (or False) in the datasets' Schema.org metadata, I wouldn't know what that means exactly. For example, you mentioned earlier that Pangea uses isAccessibleForFree and I can see it in the schema.org metadata for this dataset, but to figure out what that means, I'd have to find information that's not present in the metadata itself. The page for that Pangea dataset says I need to be logged in to download the data, but Pangea says elsewhere that downloading most of their datasets' files doesn't require login, like the dataset at https://doi.pangaea.de/10.1594/PANGAEA.921541, whose Schema.org metadata has isAccessibleForFree: True. So now I'm thinking that isAccessibleForFree is True for Pangea datasets if I don't have to log in to download the data. But I can't determine this by just looking at the Schema.org metadata.
Does this make the metadata less FAIR? The definition of the isAccessibleForFree property doesn't define what free means. But maybe it's okay to expect people who need to programmatically determine a file's access level to do a little investigation into what free means in this context, or, if it's already common practice to use isAccessibleForFree the way we've proposed (Pangea, and maybe other repositories, seem to be using it this way already) it's okay to expect that people should assume that when data repositories use isAccessibleForFree for data files, that means either there is one or more barriers to accessing the file (isAccessibleForFree:False) or there are no barriers (isAccessibleForFree:True).
Isn't "DataDownload" the
@type
at the file level? That's what's used in this issue's first comment. https://schema.org/DataDownload lists isAccessibleForFree, so I think it can be applied then?
yup, Thing > CreativeWork > MediaObject > DataDownload, so the property can be used with DataDownload.
Does this make the metadata less FAIR? The definition of the isAccessibleForFree property doesn't define what free means. But maybe it's okay to expect people who need to programmatically determine a file's access level to do a little investigation into what free means in this context, or, if it's already common practice to use isAccessibleForFree the way we've proposed (Pangea, and maybe other repositories, seem to be using it this way already) it's okay to expect that people should assume that when data repositories use isAccessibleForFree for data files, that means either there is one or more barriers to accessing the file (false) or there are no barriers (true).
For pangaea, all public datasets are set with isAccessibleForFree = True, the rest restricted datasets (embargoed, requires login) are set to False. In addition to the this property, we also use the 'conditionsOfAccess' property to communicate access data level. I agree that the property 'isAccessibleForFree' is loosely defined and mainly specified for general search, not 100% applicable to scientific datasets. Let me check with other data reposiroties....
@ashepherd, can you please let us know the way you specifiy data access level at science-on-schema.org?
Speaking of science-on-schema.org, RDA's Research Metadata Schemas WG announced updated guidelines from the ESIP Schema.org cluster for using Schema.org to describe data. It's at https://github.com/ESIPFed/science-on-schema.org (and is summarized in the RDA WG's own report). Guidelines for describing datasets specifically are at https://github.com/ESIPFed/science-on-schema.org/blob/master/guides/Dataset.md.
From a quick look it seems like the guide includes ways to add metadata that Dataverse isn't mapping to its Schema.org export and using different elements and structures to include more metadata. When we tackle this issue (updating Dataverse's Schema.org export), I think we should learn how in line these guidelines are with the FAIRsFAIR's testing tools.
@jggautier I skimmed through guidelines, the recommended fields suggested in the guidelines are currently being considered by F-UJI when evaluating a dataset except 1.catalog 2. linking physical samples to dataset. In any case, i will cross-check again the schema.org mappings captured as part of the tool with the recommendations from ESIP. @https://github.com/huberrob
DataONE hosted a community call on "Science on Schema.org Guidelines and Experiences" (https://www.dataone.org/community-calls/soso/). Collaborative notes from the meeting are posted at https://github.com/DataONEorg/community-calls/blob/master/notes/20210401_call_notes.md.
The upcoming "multiple license" work (https://github.com/IQSS/dataverse/issues/7440, https://github.com/IQSS/dataverse/issues/7742) will change how license metadata is mapped to Dataverse repositories' Schema.org exports (as well as the other metadata exports), so I updated this issue's first comment to reflect those changes.
Just putting additional information that license's "@type" should be "CreativeWork" not "Dataset", based on our Rich Results Test.
https://support.google.com/webmasters/thread/146534613?hl=en&msgid=146553381#action=helpful
Some additional things that we're finding based on google's validation:
- Creator is missing a type (and thus doesn't display on Google): https://github.com/IQSS/dataverse/issues/5029
- Description needs to be truncated at 5,000 characters
- Related publications need to have either a name or a URL
We'd be interested in working on all of these. I think the only contentious one is #5029, so if we could come to a decision on what to do there we could wrap this all in one PR
@jggautier and I have what we think is a good way forward on #5029 , so I think this is pretty doable and we'll try to put it onto our roadmap at QDR.
Related (possibly a duplicate or sub-issue):
- #7574