ckanext-dcat icon indicating copy to clipboard operation
ckanext-dcat copied to clipboard

Dereference URIs while harvesting

Open letmaik opened this issue 10 years ago • 0 comments

Currently (correct me if I'm wrong) the DCAT harvester reads exactly a single file. Now, with the advent of JSON-LD and the exposure of such catalogs as actual simple Web APIs, it will be the case that not all DCAT entries are in a single file, for example:

The following may live at http://my.domain/datasets (when requested with the proper content type):

{
  "@context": {
    "dcat": "http://www.w3.org/ns/dcat#",
    "dct": "http://purl.org/dc/terms/",
    "title":
    {
      "@id": "dct:title"
    },
    "datasets":
    {
      "@id": "dcat:dataset",
      "@type": "@id"
    }
  },
  "@id": "http://my.domain/datasets",
  "@type": "dcat:Catalog",
  "title": "My datasets",
  "datasets": [
    "http://my.domain/datasets/1",
    "http://my.domain/datasets/2",
    "http://my.domain/datasets/3"
  ]
}

And the actual dcat:Dataset entries are accessible by following the given URIs. So at http://my.domain/datasets/1 you might find:

{
  "@context": {
    "dcat": "http://www.w3.org/ns/dcat#",
    "dct": "http://purl.org/dc/terms/",
    "locn": "http://www.w3.org/ns/locn#",
    "geometry": { "@id": "locn:geometry", "@type": "gsp:wktLiteral" },
    "gsp": "http://www.opengis.net/ont/geosparql#",
    "schema": "http://schema.org/",
    "startDate": { "@id": "schema:startDate", "@type": "xsd:date" },
    "endDate": { "@id": "schema:endDate", "@type": "xsd:date" },
    "title": { "@id": "dct:title" },
    "description": { "@id": "dct:description" },
    "issued": { "@id": "dct:issued", "@type": "http://www.w3.org/2001/XMLSchema#dateTime" },
    "spatial": { "@id": "dct:spatial" },
    "temporal": { "@id": "dct:temporal" },
    "distributions": { "@id": "dcat:distribution" },
    "accessURL": { "@id": "dcat:accessURL", "@type": "@id" },
    "downloadURL": { "@id": "dcat:downloadURL", "@type": "@id" },
    "mediaType": { "@id": "dcat:mediaType" }
  },
  "@id": "http://my.domain/datasets/1",
  "@type": "dcat:Dataset",
  "title": "My first dataset",
  "description": "This is a dataset.",
  "issued": "2015-06-02",
  "spatial": {
    "@type": "dct:Location",
    "geometry": "POLYGON((-10.58 70.09,34.59 70.09,34.59 34.56,-10.58 34.56, -10.58 70.09))"
  },
  "temporal": {
    "@type": "dct:PeriodOfTime",
    "startDate": "2005-12-31",
    "endDate": "2006-12-31"
  },
  "distributions": [
    {
      "@type": "dcat:Distribution",
      "title": "GeoSPARQL endpoint",
      "accessURL": "http://my.domain/datasets/1/geosparql",
      "mediaType": "application/sparql-query"
    },
    {
      "@type": "dcat:Distribution",
      "title": "OpenDAP endpoint",
      "accessURL": "http://my.domain/datasets/1/opendap",
      "mediaType": "application/vnd.opendap.org.capabilities+json"
    }
  ]
}

So although there might be value in being able to provide a dump of everything, it may not always be easily possible, and a harvester should support both approaches and follow at least the relevant DCAT terms, not everything obviously. Does that make sense?

EDIT: I guess the same applies if you have a small version of the datasets inlined (some fields missing) but provide the full version only when following the dataset URL ("@id" field). I'm not sure how a crawler would know if the embedded dataset is complete or not, it's a bit tricky.

letmaik avatar Jun 03 '15 13:06 letmaik