ckanext-dcat
ckanext-dcat copied to clipboard
_object_value and _object_value_list return BNode identifiers
While reviewing the scheming PR #281, I've found a couple of places where the DCAT RDF Harvester in json-ld
format is having trouble with in-the-wild DCAT 2.1.1 feeds. (Specifically, an ESRI AGOL Inspire feed: https://opendata-ifigeo.hub.arcgis.com/api/feed/dcat-ap/2.1.1.json). (This doesn't appear to be related to the PR, so here it is)
Generally, _object_value
and _object_value_list
are returning the string value of the node, and in cases where the node has a type and something other than a direct value, this returns the internal node id of the BNode
.
For example, with this (not terribly useful, but syntactically representative) provenance:
"dct:provenance": {
"@type": "dct:ProvenanceStatement",
"@label": {
"@value": ""
}
},
We extract: 'provenance', ('extras', 19, 'value'): 'Nc0c0162afbe140a5afa2736468e1da4c',
.
Similarly, the theme:
"dcat:theme": {
"@type": "skos:Concept",
"skos:prefLabel": "Geospatial"
},
also returns a internal node id. This is almost never going to be a useful result, because the identifiers are ephemeral, and only valid while the graph is in memory.
I'm not clear on the best course of action here, I see a couple.
- Potentially pull out all of the items that are themselves alternate types, e.g.
provenance
is adct:ProvenanceStatement
and handle them one at a time. - Have a generic
RDF.type == SKOS.Concept
handler, but in some cases that will want to pull out an id, and some cases a prefLabel in the appropriate language. Sometimes we're going to have an enforced vocabulary from the EU, and sometimes it's going to be site defined. (e.g., theme is probably going to be site dependent, HVD Category is going to be EU wide) - Have a generic "best string we can get" and keep adding to it as a fallback.