odis-arch icon indicating copy to clipboard operation
odis-arch copied to clipboard

how to create a ODIS node for (Harvard) Dataverse and searching for all ocean data sets in it via ODIS?

Open gaelforget opened this issue 1 year ago • 10 comments

A prototypical application would be : search dataverse through ODIS to find sizable, regularly formatted, data sets for a given ocean region (e.g. coastal ocean off of New England, US)

Below I just document the bits and pieces we looked at today in discussing this idea with @pbuttigieg

  • example of a data set I host on dataverse : https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/CAGYQL
  • a check we did to verify that dataverse uses json-ld and schema.org : https://validator.schema.org/#url=https%3A%2F%2Fdataverse.harvard.edu%2Fdataset.xhtml%3FpersistentId%3Ddoi%3A10.7910%2FDVN%2FCAGYQL
  • Julia package to interact with Dataverse from Julia : https://github.com/gdcc/Dataverse.jl
  • Examples of large ocean data sets already interfaced to Julia : https://juliaocean.github.io/OceanRobots.jl/dev/
  • A community that uses Julia and folks likely interested by this : https://aircentre.github.io/JuliaEO25/
  • https://discourse.julialang.org/t/how-to-connect-to-sparql-endpoint/50415/6

gaelforget avatar Nov 14 '24 15:11 gaelforget

ping @pdurbin , @atrisovic

gaelforget avatar Nov 14 '24 15:11 gaelforget

the JSON


{
  "@context": {
    "@language": "en",
    "@vocab": "https://schema.org/",
    "citeAs": "cr:citeAs",
    "column": "cr:column",
    "conformsTo": "dct:conformsTo",
    "cr": "http://mlcommons.org/croissant/",
    "rai": "http://mlcommons.org/croissant/RAI/",
    "data": {
      "@id": "cr:data",
      "@type": "@json"
    },
    "dataType": {
      "@id": "cr:dataType",
      "@type": "@vocab"
    },
    "dct": "http://purl.org/dc/terms/",
    "examples": {
      "@id": "cr:examples",
      "@type": "@json"
    },
    "extract": "cr:extract",
    "field": "cr:field",
    "fileProperty": "cr:fileProperty",
    "fileObject": "cr:fileObject",
    "fileSet": "cr:fileSet",
    "format": "cr:format",
    "includes": "cr:includes",
    "isLiveDataset": "cr:isLiveDataset",
    "jsonPath": "cr:jsonPath",
    "key": "cr:key",
    "md5": "cr:md5",
    "parentField": "cr:parentField",
    "path": "cr:path",
    "recordSet": "cr:recordSet",
    "references": "cr:references",
    "regex": "cr:regex",
    "repeated": "cr:repeated",
    "replace": "cr:replace",
    "sc": "https://schema.org/",
    "separator": "cr:separator",
    "source": "cr:source",
    "subField": "cr:subField",
    "transform": "cr:transform",
    "wd": "https://www.wikidata.org/wiki/"
  },
  "@type": "sc:Dataset",
  "conformsTo": "http://mlcommons.org/croissant/1.0",
  "name": "Ocean Heat Content",
  "url": "https://doi.org/10.7910/DVN/CAGYQL",
  "creator": [
    {
      "@type": "Person",
      "givenName": "Gael",
      "familyName": "Forget",
      "affiliation": {
        "@type": "Organization",
        "name": "Massachusetts Institute of Technology"
      },
      "name": "Forget, Gael"
    }
  ],
  "description": "Estimates (OCCA2, ECCO4) of global ocean heat content (OHC) anomaly from 2004-2006 climatology. ECCO4 is a closed heat budget estimate. ECCO4 release 5 is used here that covers 1992-2019. OCCA2 was derived by 1. extending ECCO4 (r2) to 1980-2022 and 2. adding a gridded adjustment to Argo over 2004-2022. The 2004-2006 climatologies were subtracted separately before combining anomalies over 1992-2019.",
  "keywords": [
    "Earth and Environmental Sciences",
    "ocean",
    "climate",
    "warming"
  ],
  "license": "http://creativecommons.org/publicdomain/zero/1.0",
  "datePublished": "2024-03-07",
  "dateModified": "2024-03-08",
  "includedInDataCatalog": {
    "@type": "DataCatalog",
    "name": "Harvard Dataverse",
    "url": "https://dataverse.harvard.edu"
  },
  "publisher": {
    "@type": "Organization",
    "name": "Harvard Dataverse"
  },
  "version": "1.1",
  "citeAs": "@data{DVN/CAGYQL_2024,author = {Forget, Gael},publisher = {Harvard Dataverse},title = {Ocean Heat Content},year = {2024},url = {https://doi.org/10.7910/DVN/CAGYQL}}",
  "citation": [
    {
      "@type": "CreativeWork",
      "name": "Forget, G.: Energy Imbalance in the Sunlit Ocean Layer (submitted)"
    }
  ],
  "distribution": [
    {
      "@type": "cr:FileObject",
      "@id": "OCCA2_ECCO4_global_OHC_anomaly_1992_2019.nc",
      "name": "OCCA2_ECCO4_global_OHC_anomaly_1992_2019.nc",
      "encodingFormat": "application/x-netcdf",
      "md5": "6578a2fa4f30bdb277b8b4581de9bb6b",
      "contentSize": "14705",
      "description": "Global ocean heat anomalies, in ZJoule, computed from 2004-2006 climatology for OCCA2 (release 1) and ECCO4 (release 5)",
      "contentUrl": "https://dataverse.harvard.edu/api/access/datafile/8954362"
    },
    {
      "@type": "cr:FileObject",
      "@id": "OCCA2_ECCO4_global_OHC_anomaly_1992_2019.png",
      "name": "OCCA2_ECCO4_global_OHC_anomaly_1992_2019.png",
      "encodingFormat": "image/png",
      "md5": "81dbe65ed124c315ab7db4b0bf680186",
      "contentSize": "39385",
      "description": "Visualization of global OHC anomaly, computed from 2004-2006 climatology, for OCCA2 (release 1) and ECCO4 (release 5)",
      "contentUrl": "https://dataverse.harvard.edu/api/access/datafile/8954363"
    }
  ]
}

pbuttigieg avatar Nov 14 '24 16:11 pbuttigieg

The Croissant semantics break interoperability at the moment, with not too much gain. But most of it is immediately useful .

pbuttigieg avatar Nov 14 '24 16:11 pbuttigieg

@gaelforget I'll generate some suggestions for improved metadata based on the example above.

in the meantime, setting up the Node (even with the current form of metadata ) can begin following https://book.odis.org/gettingStarted.html

I'd set up a dedicated sitemap for ocean-related content (of any kind, socio-economic, physics, biological,...) and use that as the value of your ODIS-Arch URL in the ODISCat entry.

pbuttigieg avatar Nov 14 '24 16:11 pbuttigieg

@fils this is an opportunity to figure out how to handle Croissant semantics and types in a smart way. I'm thinking using additionalType for non-sdo stuff. That would also allow Croissant properties in the stanzas

pbuttigieg avatar Nov 14 '24 16:11 pbuttigieg

@gaelforget hi! @atrisovic and I are at a conference but my first recommendation is to

  • enable the "geospatial" metadata block for the collection that is the parent of your dataset (or its parent)
  • for the datasets, fill in the geospatial bounding boxes and republish
  • then try our geospatial search: https://guides.dataverse.org/en/6.4/user/find-use-data.html#geospatial-search

Also, you're welcome to kick off a thread in our Zulip! https://dataverse.zulipchat.com

pdurbin avatar Nov 14 '24 20:11 pdurbin

I'll post a comment for each component that is currently preventing compatibility with existing schema.org systems. We'll start with the distribution property and its value space, which currently throws a validation error:

image

Distribution

Status quo

"distribution": [
    {
      "@type": "cr:FileObject",
      "@id": "OCCA2_ECCO4_global_OHC_anomaly_1992_2019.nc",
      "name": "OCCA2_ECCO4_global_OHC_anomaly_1992_2019.nc",
      "encodingFormat": "application/x-netcdf",
      "md5": "6578a2fa4f30bdb277b8b4581de9bb6b",
      "contentSize": "14705",
      "description": "Global ocean heat anomalies, in ZJoule, computed from 2004-2006 climatology for OCCA2 (release 1) and ECCO4 (release 5)",
      "contentUrl": "https://dataverse.harvard.edu/api/access/datafile/8954362"
    },
    {
      "@type": "cr:FileObject",
      "@id": "OCCA2_ECCO4_global_OHC_anomaly_1992_2019.png",
      "name": "OCCA2_ECCO4_global_OHC_anomaly_1992_2019.png",
      "encodingFormat": "image/png",
      "md5": "81dbe65ed124c315ab7db4b0bf680186",
      "contentSize": "39385",
      "description": "Visualization of global OHC anomaly, computed from 2004-2006 climatology, for OCCA2 (release 1) and ECCO4 (release 5)",
      "contentUrl": "https://dataverse.harvard.edu/api/access/datafile/8954363"
    }
  ]

Proposed change

  • Restores the expected DataDownload type in the distribution value space, and thus the validator doesn't complain
  • Retains Croissant metadata and typing using additionalType. As an alternative, you can also include the Croissant type in an array, alongside DataDownload (see below).
  • removes @ids which don't resolve to a JSON node

Additional changes that may be useful:

  • put in a unit - KB, MB - for contentSize - the schema.org definition is ambiguous "File size in (mega/kilo)bytes."
  • If reference node @ids are to be included, ensure they point to either a JSON-LD file or something that deliver one (e.g. an embed in HTML).
"distribution": [
    {
      "@type": "DataDownload",
      "additionalType": "cr:FileObject",
      "name": "OCCA2_ECCO4_global_OHC_anomaly_1992_2019.nc",
      "encodingFormat": "application/x-netcdf",
      "md5": "6578a2fa4f30bdb277b8b4581de9bb6b",
      "contentSize": "14705",
      "description": "Global ocean heat anomalies, in ZJoule, computed from 2004-2006 climatology for OCCA2 (release 1) and ECCO4 (release 5)",
      "contentUrl": "https://dataverse.harvard.edu/api/access/datafile/8954362"
    },
    {
      "@type": "DataDownload",
      "additionalType": "cr:FileObject",
      "name": "OCCA2_ECCO4_global_OHC_anomaly_1992_2019.png",
      "encodingFormat": "image/png",
      "md5": "81dbe65ed124c315ab7db4b0bf680186",
      "contentSize": "39385",
      "description": "Visualization of global OHC anomaly, computed from 2004-2006 climatology, for OCCA2 (release 1) and ECCO4 (release 5)",
      "contentUrl": "https://dataverse.harvard.edu/api/access/datafile/8954363"
    }
  ]

The alternative using an array for types:

"distribution": [
    {
      "@type": ["DataDownload", "cr:FileObject"],
      "name": "OCCA2_ECCO4_global_OHC_anomaly_1992_2019.nc",
      "encodingFormat": "application/x-netcdf",
      "md5": "6578a2fa4f30bdb277b8b4581de9bb6b",
      "contentSize": "14705",
      "description": "Global ocean heat anomalies, in ZJoule, computed from 2004-2006 climatology for OCCA2 (release 1) and ECCO4 (release 5)",
      "contentUrl": "https://dataverse.harvard.edu/api/access/datafile/8954362"
    }

Verify validation

image

pbuttigieg avatar Nov 18 '24 09:11 pbuttigieg

@gaelforget

The change to distribution described above fixes the validation errors and - in principle - should make the record fine for discoverability in ODIS. What we'll then need is a sitemap pointing to all the records you wish to share over ODIS, and your registration in OceanExpert and ODISCat. All described here.

That being said, it seems Croissant semantics are introducing some "noise" in addition to their very useful extensions of the base schema.org context. As mentioned, we'll likely write some guidance on how to best merge the two, without duplication / reinvention of things that vanilla schema.org already does.

pbuttigieg avatar Nov 18 '24 09:11 pbuttigieg

We'll start with the distribution property and its value space, which currently throws a validation error

This is a known issue, seeing http://mlcommons.org/croissant/FileObject is not a known valid target type for the distribution property as a validation error. Please see this issue:

  • https://github.com/mlcommons/croissant/issues/725

pdurbin avatar Nov 18 '24 14:11 pdurbin

I just added a json_ld.get function in https://github.com/gdcc/Dataverse.jl/pull/30 that :

  • extracts the ld+json part of a Dataset's html page from Harvard Dataverse.
  • patches it as described above to ensure compatibility with https://validator.schema.org/.

The Dataset referred to in the above thread is used for demo. Selected via its doi.

Below is a code snippet and printout of variables.

julia> using Dataverse

julia> j=json_ld.get("10.7910/DVN/CAGYQL")
Dict{String, Any} with 17 entries:
  "publisher"             => Dict{String, Any}("name"=>"Harvard Dataverse", "@type"=>"Organization")
  "keywords"              => Any["Earth and Environmental Sciences", "ocean", "climate", "warming"]
  "citeAs"                => "@data{DVN/CAGYQL_2024,author = {Forget, Gael},publisher = {Harvard Dataverse},title = {Ocea…
  "name"                  => "Ocean Heat Content"
  "distribution"          => Any[Dict{Any, Any}("name"=>"OCCA2_ECCO4_global_OHC_anomaly_1992_2019.nc", "additionalType"=>…
  "description"           => "Estimates (OCCA2, ECCO4) of global ocean heat content (OHC) anomaly from 2004-2006 climatol…
  "version"               => "1.1"
  "@context"              => Dict{String, Any}("dct"=>"http://purl.org/dc/terms/", "replace"=>"cr:replace", "dataType"=>D…
  "creator"               => Any[Dict{String, Any}("name"=>"Forget, Gael", "givenName"=>"Gael", "familyName"=>"Forget", "…
  "datePublished"         => "2024-03-07"
  "citation"              => Any[Dict{String, Any}("name"=>"Forget, G.: Energy Imbalance in the Sunlit Ocean Layer (submi…
  "url"                   => "https://doi.org/10.7910/DVN/CAGYQL"
  "conformsTo"            => "http://mlcommons.org/croissant/1.0"
  "includedInDataCatalog" => Dict{String, Any}("name"=>"Harvard Dataverse", "@type"=>"DataCatalog", "url"=>"https://datav…
  "license"               => "http://creativecommons.org/publicdomain/zero/1.0"
  "dateModified"          => "2024-03-08"
  "@type"                 => "sc:Dataset"

julia> j["distribution"]
2-element Vector{Any}:
 Dict{Any, Any}("name" => "OCCA2_ECCO4_global_OHC_anomaly_1992_2019.nc", "additionalType" => "cr:FileObject", "encodingFormat" => "application/x-netcdf", "md5" => "6578a2fa4f30bdb277b8b4581de9bb6b", "contentUrl" => "https://dataverse.harvard.edu/api/access/datafile/8954362", "contentSize" => "14705", "@type" => "DataDownload", "description" => "Global ocean heat anomalies, in ZJoule, computed from 2004-2006 climatology for OCCA2 (release 1) and ECCO4 (release 5)")
 Dict{Any, Any}("name" => "OCCA2_ECCO4_global_OHC_anomaly_1992_2019.png", "additionalType" => "cr:FileObject", "encodingFormat" => "image/png", "md5" => "81dbe65ed124c315ab7db4b0bf680186", "contentUrl" => "https://dataverse.harvard.edu/api/access/datafile/8954363", "contentSize" => "39385", "@type" => "DataDownload", "description" => "Visualization of global OHC anomaly, computed from 2004-2006 climatology, for OCCA2 (release 1) and ECCO4 (release 5)")

julia> json_ld.get("10.7910/DVN/CAGYQL",to_file=true)
"/var/folders/.../jl_CZclqzCHgV.json"

gaelforget avatar Mar 16 '25 22:03 gaelforget