ckanext-dcat
ckanext-dcat copied to clipboard
Support for multilingual RDF
Right now, neither the parsers nor the serializers take multilingual metadata into account.
For instance given the following document, a random title among the three will be picked up during parsing time:
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
<http://data.london.gov.uk/dataset/Abandoned_Vehicles>
a dcat:Dataset ;
dct:title "Abandoned Vehicles"@en ;
dct:title "Vehículos Abandonados"@es ;
adms:versionNotes "Some version notes"@en ;
adms:versionNotes "Notas de la versión"@es ;
...
Parsing
The standard way of dealing with this seems to be to create metadata during the parsing that can be handled by ckanext-fluent when creating or updating the datasets. This essentially means storing a dict instead of a string, with the keys being the language codes:
{
"version_notes": {
"en": "Some version notes",
"es": "Notas de la versión"
}
...
}
For core fields like title
or notes
, we need to add an extra field suffixed with _translated
:
"title": "",
"title_translated": {
"en": "Abandoned Vehicles",
"es": "Vehiculos Abandonados"
}
...
TODO: what to put in title
?
To support it we can proabably have a variant of _object_value
that handles the lang tags and returns a dict accordingly (RDFLib will return a different triple for each language).
Serializing
Similarly, the serializing code could check the fields marked as multilingual to see if they are a string or a dict and create triples accordingly, proabably via a helper function.
Things to think about:
- Should this be the default or enabled via config option?
- This will probably require using ckanext-scheming as well, otherwise multilingual fields won't be properly stored (#56).
@wardi does that sound right? Also see the TODO above, does it matter what we put in there?
@amercader We start to implement this for DCAT-AP Switzerland, I'll keep you posted. We currently use the ckanext-fluent approach.
Fantastic @metaodi! Let me know if you want me to help with some spec or discussion
Btw: here is the implementation of our multilingual DCAT-AP Switzerland profile: https://github.com/ogdch/ckanext-switzerland/blob/01652937c8f31f46d8560ab9527826a3c1523c06/ckanext/switzerland/dcat/profiles.py
Behind the scenes we use ckanext-scheming for validation/schema.
The main change to the "original" is the new parameter multilang
in the _object_value
method. We simply use this for all values where we expect multilingual values.
Note there are two ongoing PRs with initial implementations:
- RDF -> CKAN (Parsing): https://github.com/ckan/ckanext-dcat/pull/124
- CKAN -> RDF (Serializing): https://github.com/ckan/ckanext-dcat/pull/240