virtuoso-opensource
virtuoso-opensource copied to clipboard
Error in handling of Unicode characters with SPARQL CONCAT function
There is an issue with handling of Unicode characters with combination of SPARQL CONCAT and ENCODE_FOR_URI functions.
When used like this: BIND(CONCAT("https://c/é/", ENCODE_FOR_URI("Á")) as ?c)
, the resulting literal is https://c/\u00E9/\u00C3\u0081%00
which, when decoded, is https://c/é/Ã
, which is wrong.
When used like this (note that in the first string in Concat, I replace é
with e
: BIND(CONCAT("https://d/e/", ENCODE_FOR_URI("Á")) as ?d)
, the resulting literal is https://d/e/\u00C1%00
, which, when decoded, is https://d/e/Á
, which is correct. Not sure whether the problem is in the CONCAT or the ENCODE_FOR_URI function.
This query can be run on https://data.gov.cz/sparql or https://dev.nkod.opendata.cz/sparql:
CONSTRUCT {
?a ?b ?c, ?d .
}
WHERE {
?test a dcat:Dataset .
BIND(IRI(CONCAT("https://a/é/", ENCODE_FOR_URI("Á"))) as ?a)
BIND(IRI(CONCAT("https://b/e/", ENCODE_FOR_URI("Á"))) as ?b)
BIND(CONCAT("https://c/é/", ENCODE_FOR_URI("Á")) as ?c)
BIND(CONCAT("https://d/e/", ENCODE_FOR_URI("Á")) as ?d)
}
Note that ?test a dcat:Dataset .
is not necessary and can be replaced by anything which matches something in the graph. It could be omitted, but that triggers this 6,5 years old issue: https://github.com/openlink/virtuoso-opensource/issues/231 when run directly on the Virtuoso SPARQL Endpoint.
When run in Yasgui (https://api.triplydb.com/s/8E0WDV550) it works even without this.
@jakubklimek -- Note that you can drop the ?test a dcat:Dataset .
pattern and run the query against both of your listed endpoints, if you either un-tick the box for "Strict checking of void variables" on the SPARQL query form (as noted in the comments on @231) or insert define sql:signal-void-variables 0
before CONSTRUCT
in your query. (The define
option also works through saved URLs, as on data.gov.cz (query, results) or dev.nkod.opendata.cz (query, results).)
Is there a reason you're using a CONSTRUCT
query to test, instead of a SELECT
? (At a quick glance, the encoding issue appears to happen in both; I just want to be sure I'm not missing something.)
@smalinin @pkleef @iv-an-ru -- Please take a look at this.
@TallTed thanks, I knew there was a workaround for this somewhere.
I used CONSTRUCT just because that is how I discovered the bug and went on minimizing the example, no other reason.
I ran into this issue even without ENCODE_FOR_URI
. It therefore seems to be contained to CONCAT
. Whenever there is a unicode character used in CONCAT
, the result is badly encoded:
PREFIX dcat: <http://www.w3.org/ns/dcat#>
SELECT ?changed WHERE {
?dataset a dcat:Dataset .
BIND(CONCAT("ě", ?dataset) AS ?changed)
}
LIMIT 1
— produces ěhttps://data.gov.cz/zdroj/datové-sady/https---isdv.upv.cz-opendata-upv-package_show-id-vz20210307diff
while —
PREFIX dcat: <http://www.w3.org/ns/dcat#>
SELECT ?changed WHERE {
?dataset a dcat:Dataset .
BIND(CONCAT("e", ?dataset) AS ?changed)
}
LIMIT 1
— produces ehttps://data.gov.cz/zdroj/datové-sady/https---isdv.upv.cz-opendata-upv-package_show-id-vz20210307diff
(note the first character and then datové-sady
vs datové-sady
)
Still happening in https://github.com/openlink/virtuoso-opensource/commit/8baf8a90afc842c52b7d2f44af0ca99c88d85b68
Still happening in a7b01eced76532f1fa36fdf665f9f836531bdae0
@smalinin @pkleef @iv-an-ru @hughwilliams @openlink -- Any estimate of when this will be investigated, if not resolved? It seems likely to be causing trouble if not blocking a good number of deployments where Unicode is in broader use.
@pkleef any chance of looking into this when you are dealing with unicode related issues? :)