Dereference URIs while harvesting
Currently (correct me if I'm wrong) the DCAT harvester reads exactly a single file. Now, with the advent of JSON-LD and the exposure of such catalogs as actual simple Web APIs, it will be the case that not all DCAT entries are in a single file, for example:
The following may live at http://my.domain/datasets (when requested with the proper content type):
{
"@context": {
"dcat": "http://www.w3.org/ns/dcat#",
"dct": "http://purl.org/dc/terms/",
"title":
{
"@id": "dct:title"
},
"datasets":
{
"@id": "dcat:dataset",
"@type": "@id"
}
},
"@id": "http://my.domain/datasets",
"@type": "dcat:Catalog",
"title": "My datasets",
"datasets": [
"http://my.domain/datasets/1",
"http://my.domain/datasets/2",
"http://my.domain/datasets/3"
]
}
And the actual dcat:Dataset entries are accessible by following the given URIs. So at http://my.domain/datasets/1 you might find:
{
"@context": {
"dcat": "http://www.w3.org/ns/dcat#",
"dct": "http://purl.org/dc/terms/",
"locn": "http://www.w3.org/ns/locn#",
"geometry": { "@id": "locn:geometry", "@type": "gsp:wktLiteral" },
"gsp": "http://www.opengis.net/ont/geosparql#",
"schema": "http://schema.org/",
"startDate": { "@id": "schema:startDate", "@type": "xsd:date" },
"endDate": { "@id": "schema:endDate", "@type": "xsd:date" },
"title": { "@id": "dct:title" },
"description": { "@id": "dct:description" },
"issued": { "@id": "dct:issued", "@type": "http://www.w3.org/2001/XMLSchema#dateTime" },
"spatial": { "@id": "dct:spatial" },
"temporal": { "@id": "dct:temporal" },
"distributions": { "@id": "dcat:distribution" },
"accessURL": { "@id": "dcat:accessURL", "@type": "@id" },
"downloadURL": { "@id": "dcat:downloadURL", "@type": "@id" },
"mediaType": { "@id": "dcat:mediaType" }
},
"@id": "http://my.domain/datasets/1",
"@type": "dcat:Dataset",
"title": "My first dataset",
"description": "This is a dataset.",
"issued": "2015-06-02",
"spatial": {
"@type": "dct:Location",
"geometry": "POLYGON((-10.58 70.09,34.59 70.09,34.59 34.56,-10.58 34.56, -10.58 70.09))"
},
"temporal": {
"@type": "dct:PeriodOfTime",
"startDate": "2005-12-31",
"endDate": "2006-12-31"
},
"distributions": [
{
"@type": "dcat:Distribution",
"title": "GeoSPARQL endpoint",
"accessURL": "http://my.domain/datasets/1/geosparql",
"mediaType": "application/sparql-query"
},
{
"@type": "dcat:Distribution",
"title": "OpenDAP endpoint",
"accessURL": "http://my.domain/datasets/1/opendap",
"mediaType": "application/vnd.opendap.org.capabilities+json"
}
]
}
So although there might be value in being able to provide a dump of everything, it may not always be easily possible, and a harvester should support both approaches and follow at least the relevant DCAT terms, not everything obviously. Does that make sense?
EDIT: I guess the same applies if you have a small version of the datasets inlined (some fields missing) but provide the full version only when following the dataset URL ("@id" field). I'm not sure how a crawler would know if the embedded dataset is complete or not, it's a bit tricky.