[BUG] Unexpected behaviour of language config
Describe the bug
When configuring languages in a .properties file a user can get unexpected behavior from wikidata translations by adding capitalization to hyphenated country codes e.g. zh-Hans vs zh-hans. The country codes are passed through to the SPARQL query in wikidata.java with capitalization which leads to them not matching anything in wiki data. When looking up country codes online it's common to find hyphenated examples that are capitalised.
To Reproduce Try running the SPARQL query with some values against the wikidata query service:
SELECT ?label ?lang WHERE {
# For specific Wikidata entities
VALUES ?id { wd:Q1726 }
# Get the labels (names) in various languages
?id rdfs:label ?label .
# Only fetch labels in the languages we care about
FILTER(LANG(?label) IN ("zh-Hans", "zh-Hant"))
# Extract the language code from the label
BIND(LANG(?label) as ?lang)
}
vs
SELECT ?label ?lang WHERE {
# For specific Wikidata entities
VALUES ?id { wd:Q1726 }
# Get the labels (names) in various languages
?id rdfs:label ?label .
# Only fetch labels in the languages we care about
FILTER(LANG(?label) IN ("zh-hans", "zh-hant"))
# Extract the language code from the label
BIND(LANG(?label) as ?lang)
}
As the case of the config languages in the config list also sets the attribute in the tileset, I would simply propose that we add toLowerCase() when formatting the SPARQL query in wikidata.java. I'm happy to open a PR for this.
Good catch! I think automatically lower-casing these language tags should be fine, @1ec5 do you see any issue with this? If so then feel free to open a PR for this.
Yes, I think lowercasing the language tag should be safe. Mixed case is strongly recommended by ISO and IETF, but technically everything is supposed to also handle them case-insensitively. I think Wikidata lowercases the codes for consistency with their subdomains. It seems like there’s little appetite for moving Wikidata to mixed-case language codes, but you could bake the LCASE() into the query to be extra safe.