cubiql
cubiql copied to clipboard
Multi lingual dataset support
RDF supports lang strings, and there's a possibility of multi-lingual datasets.
We may want to add support for this as part of OGI.
I agree. At OGI there are multi-lingual datasets.
We may consider using JSON-LD (@language) to express the language used
I'm no longer sure we can use JSONLD, but am curious about the requirements for multiple languages.
For example would a multilingual client want to list all labels in all languages? Or should it only ever get back a single requested (or default) language?
e.g. you could imagine changing the language at the outermost field for the whole subtree e.g.:
{
datasets(language:"fr") {
title
dimensions {
values {
label
}
}
}
}
Obviously we could also let you query for what languages are currently in the system, e.g.
{
languages {
country_code
}
}
Other alternatives are to expand every string field into two sub fields of lang
and value
, which seems pretty heavy handed. Or to generate fields in the schema for every language in the system e.g. title_fr
title_en
title_gb
.
I think a single requested or default language is enough. So something like "datasets(language:"fr"){..}; is ok.
It is preferable to get the available languages for a specific dataset not for the whole system because different datasets may have different available languages. e.g.
{
languages(dataset: "http://statistics.gov.scot/data/earnings") {
country_code
}
}
It is preferable to get the available languages for a specific dataset not for the whole system because different datasets may have different available languages.
👍
Some issues related to the language:
- greek labels were not supported
- language tag (e.g. @en) causes errors
Specifically the current problem with language strings is that they cause exceptions during schema generation by failing the following spec (from issue #53):
In: [0 :objects :dataset_vehicles_cube 1 :description] val: #grafter.rdf.protocols.LangString{:string "Vehicles Cube", :lang :en} fails spec: :com.walmartlabs.lacinia.schema/description at: [:args :schema :objects 1 :description] predicate: string?
@zeginis I think it would be desirable to keep the graphql schema simple here and avoid having to represent multiple languages in the schema at this stage, i.e. we should avoid doing things like this for every label/title:
{
title {
title # the real title string
language
}
}
i.e. I think I'd rather keep the schema for labels flat like this:
{
title
}
This will probably mean in the cases of multiple languages setting a default to use everywhere throughout the API; we could potentially allow toggling the default at the top of the query.
@zeginis Does that sound like an acceptable compromise? Limitation is that within a single request you'll not be able to see things like the title for a dataset in english and greek.
It is ok to define the language at the top of the query and thus get results only in one language
One other question @zeginis, would it be acceptable to not let you set this at the top of the query; but to supply it as a configuration option to the server itself? i.e. no schema representation at all?
One other question @zeginis, would it be acceptable to not let you set this at the top of the query; but to supply it as a configuration option to the server itself? i.e. no schema representation at all?
@RickMoynihan this solution is not applicable at OGI since we will have cubes from many pilots at the same server that will have labels in different languages e.g. Greek, English.
So it is preferable to define the language at the top of the query. Any idea how to do this?
Any idea how to do this?
It's not currently supported; if you're asking about how I think it should be implemented, then I'd suggest:
- We should introduce a new root
cubiql
node to support various parameters such as this for all subtree schemas. The idea being that parameters set at the root affect those parts of the query within its lexical scope:
i.e. we would probably have to change it to do this, so lang_preference
affects not just datasets
but specific dataset schemas, and any others we add too:
{
cubiql(lang_preference: "gr") {
datasets {
title
description
}
}
- The
lang_preference
attribute should specify a language tag preference, not a hard constraint, i.e. if you have a dataset with adcterms:title
of"School"^^xsd:string
and"σχολείο"@gr
it should select the greek title. If however fordescription
it only has:school-ds dcterms:description "Numbers of schools by area"^^xsd:string, then it should fallback and return the
xsd:string`.
In terms of implementation I don't think there is a good way to express this priority on labels in SPARQL in a performant and simple enough way. So I think the best way to implement this is to make sure we implement all these queries as CONSTRUCT
s, and then implement the priority filtering on all returned data. The algorithm roughly would be to group the local graph of ?s ?p ?o
by ?s ?p
, then for each ?s ?p
where ?o
is DATATYPE
xsd:string || rdf:langString
return only ?o
where the ?o
matches lang_preference
or failing that return an xsd:string
and failing that return any other rdf:langString
.
Is something like this what you were thinking of implementing?
Yes this is what I was thinking to implement. I realize that it is not as simple as I expected.
Do you think there is a way to temporarily overcome the exceptions (#88) caused by the language tags even if we do not fully support filtering by language?
That's a good question @zeginis. I suspect it's a pretty trivial fix to make that specific error go away, as it's probably not much more than calling str
on the language tagged string before returning it.
However there's still the expectation that there's only ONE value for a lot of these fields. So this would likely only really work for string properties with a cardinality of 1; as to retain the schema you'll need to pick just one string; and then you're into the territory of the above suggestion.
I could be wrong but I'm not sure this hacky solution is worth doing, because you either need to implement the prioritisation logic above, or return a random string (unnacceptable as datasets would render with mixed languages), or hack your data so you only ever have one string for these fields (either an rdf:langString
or xsd:string
would work -- but not both or more than one of each - i.e. no multi-lingual datasets). My feeling is if you have to hack your data to remove the strings you don't want, you might as well have just hacked your data to make them xsd:string
s.
The only counter-argument I can see to this (in support of implementing the str
hack) is that it does mean cubiql will support a more correct subset of a larger cube. i.e. it's marginally better to allow "σχολείο"@gr
, in preference to "σχολείο"^^xsd:string
; as you're not downgrading information; you're just loading a subset into cubiqls endpoint.
Practically speaking though, I'm not sure this correctness argument holds much weight though as you'll still need to hack your data to guarantee it works... it's just the hack is a tiny bit less hacky.
@RickMoynihan any update on this?
Are you going to fix this or we should go on with the "quick fix" option -> call the str on the language tagged string before returning it ?