how to create a ODIS node for (Harvard) Dataverse and searching for all ocean data sets in it via ODIS?
A prototypical application would be : search dataverse through ODIS to find sizable, regularly formatted, data sets for a given ocean region (e.g. coastal ocean off of New England, US)
Below I just document the bits and pieces we looked at today in discussing this idea with @pbuttigieg
- example of a data set I host on dataverse : https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/CAGYQL
- a check we did to verify that dataverse uses json-ld and schema.org : https://validator.schema.org/#url=https%3A%2F%2Fdataverse.harvard.edu%2Fdataset.xhtml%3FpersistentId%3Ddoi%3A10.7910%2FDVN%2FCAGYQL
- Julia package to interact with Dataverse from Julia : https://github.com/gdcc/Dataverse.jl
- Examples of large ocean data sets already interfaced to Julia : https://juliaocean.github.io/OceanRobots.jl/dev/
- A community that uses Julia and folks likely interested by this : https://aircentre.github.io/JuliaEO25/
- https://discourse.julialang.org/t/how-to-connect-to-sparql-endpoint/50415/6
ping @pdurbin , @atrisovic
the JSON
{
"@context": {
"@language": "en",
"@vocab": "https://schema.org/",
"citeAs": "cr:citeAs",
"column": "cr:column",
"conformsTo": "dct:conformsTo",
"cr": "http://mlcommons.org/croissant/",
"rai": "http://mlcommons.org/croissant/RAI/",
"data": {
"@id": "cr:data",
"@type": "@json"
},
"dataType": {
"@id": "cr:dataType",
"@type": "@vocab"
},
"dct": "http://purl.org/dc/terms/",
"examples": {
"@id": "cr:examples",
"@type": "@json"
},
"extract": "cr:extract",
"field": "cr:field",
"fileProperty": "cr:fileProperty",
"fileObject": "cr:fileObject",
"fileSet": "cr:fileSet",
"format": "cr:format",
"includes": "cr:includes",
"isLiveDataset": "cr:isLiveDataset",
"jsonPath": "cr:jsonPath",
"key": "cr:key",
"md5": "cr:md5",
"parentField": "cr:parentField",
"path": "cr:path",
"recordSet": "cr:recordSet",
"references": "cr:references",
"regex": "cr:regex",
"repeated": "cr:repeated",
"replace": "cr:replace",
"sc": "https://schema.org/",
"separator": "cr:separator",
"source": "cr:source",
"subField": "cr:subField",
"transform": "cr:transform",
"wd": "https://www.wikidata.org/wiki/"
},
"@type": "sc:Dataset",
"conformsTo": "http://mlcommons.org/croissant/1.0",
"name": "Ocean Heat Content",
"url": "https://doi.org/10.7910/DVN/CAGYQL",
"creator": [
{
"@type": "Person",
"givenName": "Gael",
"familyName": "Forget",
"affiliation": {
"@type": "Organization",
"name": "Massachusetts Institute of Technology"
},
"name": "Forget, Gael"
}
],
"description": "Estimates (OCCA2, ECCO4) of global ocean heat content (OHC) anomaly from 2004-2006 climatology. ECCO4 is a closed heat budget estimate. ECCO4 release 5 is used here that covers 1992-2019. OCCA2 was derived by 1. extending ECCO4 (r2) to 1980-2022 and 2. adding a gridded adjustment to Argo over 2004-2022. The 2004-2006 climatologies were subtracted separately before combining anomalies over 1992-2019.",
"keywords": [
"Earth and Environmental Sciences",
"ocean",
"climate",
"warming"
],
"license": "http://creativecommons.org/publicdomain/zero/1.0",
"datePublished": "2024-03-07",
"dateModified": "2024-03-08",
"includedInDataCatalog": {
"@type": "DataCatalog",
"name": "Harvard Dataverse",
"url": "https://dataverse.harvard.edu"
},
"publisher": {
"@type": "Organization",
"name": "Harvard Dataverse"
},
"version": "1.1",
"citeAs": "@data{DVN/CAGYQL_2024,author = {Forget, Gael},publisher = {Harvard Dataverse},title = {Ocean Heat Content},year = {2024},url = {https://doi.org/10.7910/DVN/CAGYQL}}",
"citation": [
{
"@type": "CreativeWork",
"name": "Forget, G.: Energy Imbalance in the Sunlit Ocean Layer (submitted)"
}
],
"distribution": [
{
"@type": "cr:FileObject",
"@id": "OCCA2_ECCO4_global_OHC_anomaly_1992_2019.nc",
"name": "OCCA2_ECCO4_global_OHC_anomaly_1992_2019.nc",
"encodingFormat": "application/x-netcdf",
"md5": "6578a2fa4f30bdb277b8b4581de9bb6b",
"contentSize": "14705",
"description": "Global ocean heat anomalies, in ZJoule, computed from 2004-2006 climatology for OCCA2 (release 1) and ECCO4 (release 5)",
"contentUrl": "https://dataverse.harvard.edu/api/access/datafile/8954362"
},
{
"@type": "cr:FileObject",
"@id": "OCCA2_ECCO4_global_OHC_anomaly_1992_2019.png",
"name": "OCCA2_ECCO4_global_OHC_anomaly_1992_2019.png",
"encodingFormat": "image/png",
"md5": "81dbe65ed124c315ab7db4b0bf680186",
"contentSize": "39385",
"description": "Visualization of global OHC anomaly, computed from 2004-2006 climatology, for OCCA2 (release 1) and ECCO4 (release 5)",
"contentUrl": "https://dataverse.harvard.edu/api/access/datafile/8954363"
}
]
}
The Croissant semantics break interoperability at the moment, with not too much gain. But most of it is immediately useful .
@gaelforget I'll generate some suggestions for improved metadata based on the example above.
in the meantime, setting up the Node (even with the current form of metadata ) can begin following https://book.odis.org/gettingStarted.html
I'd set up a dedicated sitemap for ocean-related content (of any kind, socio-economic, physics, biological,...) and use that as the value of your ODIS-Arch URL in the ODISCat entry.
@fils this is an opportunity to figure out how to handle Croissant semantics and types in a smart way. I'm thinking using additionalType for non-sdo stuff. That would also allow Croissant properties in the stanzas
@gaelforget hi! @atrisovic and I are at a conference but my first recommendation is to
- enable the "geospatial" metadata block for the collection that is the parent of your dataset (or its parent)
- for the datasets, fill in the geospatial bounding boxes and republish
- then try our geospatial search: https://guides.dataverse.org/en/6.4/user/find-use-data.html#geospatial-search
Also, you're welcome to kick off a thread in our Zulip! https://dataverse.zulipchat.com
I'll post a comment for each component that is currently preventing compatibility with existing schema.org systems. We'll start with the distribution property and its value space, which currently throws a validation error:
Distribution
Status quo
"distribution": [
{
"@type": "cr:FileObject",
"@id": "OCCA2_ECCO4_global_OHC_anomaly_1992_2019.nc",
"name": "OCCA2_ECCO4_global_OHC_anomaly_1992_2019.nc",
"encodingFormat": "application/x-netcdf",
"md5": "6578a2fa4f30bdb277b8b4581de9bb6b",
"contentSize": "14705",
"description": "Global ocean heat anomalies, in ZJoule, computed from 2004-2006 climatology for OCCA2 (release 1) and ECCO4 (release 5)",
"contentUrl": "https://dataverse.harvard.edu/api/access/datafile/8954362"
},
{
"@type": "cr:FileObject",
"@id": "OCCA2_ECCO4_global_OHC_anomaly_1992_2019.png",
"name": "OCCA2_ECCO4_global_OHC_anomaly_1992_2019.png",
"encodingFormat": "image/png",
"md5": "81dbe65ed124c315ab7db4b0bf680186",
"contentSize": "39385",
"description": "Visualization of global OHC anomaly, computed from 2004-2006 climatology, for OCCA2 (release 1) and ECCO4 (release 5)",
"contentUrl": "https://dataverse.harvard.edu/api/access/datafile/8954363"
}
]
Proposed change
- Restores the expected
DataDownloadtype in thedistributionvalue space, and thus the validator doesn't complain - Retains Croissant metadata and typing using
additionalType. As an alternative, you can also include the Croissant type in an array, alongsideDataDownload(see below). - removes
@ids which don't resolve to a JSON node
Additional changes that may be useful:
- put in a unit - KB, MB - for
contentSize- the schema.org definition is ambiguous "File size in (mega/kilo)bytes." - If reference node
@ids are to be included, ensure they point to either a JSON-LD file or something that deliver one (e.g. an embed in HTML).
"distribution": [
{
"@type": "DataDownload",
"additionalType": "cr:FileObject",
"name": "OCCA2_ECCO4_global_OHC_anomaly_1992_2019.nc",
"encodingFormat": "application/x-netcdf",
"md5": "6578a2fa4f30bdb277b8b4581de9bb6b",
"contentSize": "14705",
"description": "Global ocean heat anomalies, in ZJoule, computed from 2004-2006 climatology for OCCA2 (release 1) and ECCO4 (release 5)",
"contentUrl": "https://dataverse.harvard.edu/api/access/datafile/8954362"
},
{
"@type": "DataDownload",
"additionalType": "cr:FileObject",
"name": "OCCA2_ECCO4_global_OHC_anomaly_1992_2019.png",
"encodingFormat": "image/png",
"md5": "81dbe65ed124c315ab7db4b0bf680186",
"contentSize": "39385",
"description": "Visualization of global OHC anomaly, computed from 2004-2006 climatology, for OCCA2 (release 1) and ECCO4 (release 5)",
"contentUrl": "https://dataverse.harvard.edu/api/access/datafile/8954363"
}
]
The alternative using an array for types:
"distribution": [
{
"@type": ["DataDownload", "cr:FileObject"],
"name": "OCCA2_ECCO4_global_OHC_anomaly_1992_2019.nc",
"encodingFormat": "application/x-netcdf",
"md5": "6578a2fa4f30bdb277b8b4581de9bb6b",
"contentSize": "14705",
"description": "Global ocean heat anomalies, in ZJoule, computed from 2004-2006 climatology for OCCA2 (release 1) and ECCO4 (release 5)",
"contentUrl": "https://dataverse.harvard.edu/api/access/datafile/8954362"
}
Verify validation
@gaelforget
The change to distribution described above fixes the validation errors and - in principle - should make the record fine for discoverability in ODIS. What we'll then need is a sitemap pointing to all the records you wish to share over ODIS, and your registration in OceanExpert and ODISCat. All described here.
That being said, it seems Croissant semantics are introducing some "noise" in addition to their very useful extensions of the base schema.org context. As mentioned, we'll likely write some guidance on how to best merge the two, without duplication / reinvention of things that vanilla schema.org already does.
We'll start with the
distributionproperty and its value space, which currently throws a validation error
This is a known issue, seeing http://mlcommons.org/croissant/FileObject is not a known valid target type for the distribution property as a validation error. Please see this issue:
- https://github.com/mlcommons/croissant/issues/725
I just added a json_ld.get function in https://github.com/gdcc/Dataverse.jl/pull/30 that :
- extracts the
ld+jsonpart of aDataset'shtmlpage fromHarvard Dataverse. - patches it as described above to ensure compatibility with https://validator.schema.org/.
The Dataset referred to in the above thread is used for demo. Selected via its doi.
Below is a code snippet and printout of variables.
julia> using Dataverse
julia> j=json_ld.get("10.7910/DVN/CAGYQL")
Dict{String, Any} with 17 entries:
"publisher" => Dict{String, Any}("name"=>"Harvard Dataverse", "@type"=>"Organization")
"keywords" => Any["Earth and Environmental Sciences", "ocean", "climate", "warming"]
"citeAs" => "@data{DVN/CAGYQL_2024,author = {Forget, Gael},publisher = {Harvard Dataverse},title = {Ocea…
"name" => "Ocean Heat Content"
"distribution" => Any[Dict{Any, Any}("name"=>"OCCA2_ECCO4_global_OHC_anomaly_1992_2019.nc", "additionalType"=>…
"description" => "Estimates (OCCA2, ECCO4) of global ocean heat content (OHC) anomaly from 2004-2006 climatol…
"version" => "1.1"
"@context" => Dict{String, Any}("dct"=>"http://purl.org/dc/terms/", "replace"=>"cr:replace", "dataType"=>D…
"creator" => Any[Dict{String, Any}("name"=>"Forget, Gael", "givenName"=>"Gael", "familyName"=>"Forget", "…
"datePublished" => "2024-03-07"
"citation" => Any[Dict{String, Any}("name"=>"Forget, G.: Energy Imbalance in the Sunlit Ocean Layer (submi…
"url" => "https://doi.org/10.7910/DVN/CAGYQL"
"conformsTo" => "http://mlcommons.org/croissant/1.0"
"includedInDataCatalog" => Dict{String, Any}("name"=>"Harvard Dataverse", "@type"=>"DataCatalog", "url"=>"https://datav…
"license" => "http://creativecommons.org/publicdomain/zero/1.0"
"dateModified" => "2024-03-08"
"@type" => "sc:Dataset"
julia> j["distribution"]
2-element Vector{Any}:
Dict{Any, Any}("name" => "OCCA2_ECCO4_global_OHC_anomaly_1992_2019.nc", "additionalType" => "cr:FileObject", "encodingFormat" => "application/x-netcdf", "md5" => "6578a2fa4f30bdb277b8b4581de9bb6b", "contentUrl" => "https://dataverse.harvard.edu/api/access/datafile/8954362", "contentSize" => "14705", "@type" => "DataDownload", "description" => "Global ocean heat anomalies, in ZJoule, computed from 2004-2006 climatology for OCCA2 (release 1) and ECCO4 (release 5)")
Dict{Any, Any}("name" => "OCCA2_ECCO4_global_OHC_anomaly_1992_2019.png", "additionalType" => "cr:FileObject", "encodingFormat" => "image/png", "md5" => "81dbe65ed124c315ab7db4b0bf680186", "contentUrl" => "https://dataverse.harvard.edu/api/access/datafile/8954363", "contentSize" => "39385", "@type" => "DataDownload", "description" => "Visualization of global OHC anomaly, computed from 2004-2006 climatology, for OCCA2 (release 1) and ECCO4 (release 5)")
julia> json_ld.get("10.7910/DVN/CAGYQL",to_file=true)
"/var/folders/.../jl_CZclqzCHgV.json"