virtuoso-opensource icon indicating copy to clipboard operation
virtuoso-opensource copied to clipboard

Error in handling of Unicode characters with SPARQL CONCAT function

Open jakubklimek opened this issue 4 years ago • 7 comments

There is an issue with handling of Unicode characters with combination of SPARQL CONCAT and ENCODE_FOR_URI functions.

When used like this: BIND(CONCAT("https://c/é/", ENCODE_FOR_URI("Á")) as ?c), the resulting literal is https://c/\u00E9/\u00C3\u0081%00 which, when decoded, is https://c/é/Á, which is wrong.

When used like this (note that in the first string in Concat, I replace é with e: BIND(CONCAT("https://d/e/", ENCODE_FOR_URI("Á")) as ?d), the resulting literal is https://d/e/\u00C1%00, which, when decoded, is https://d/e/Á, which is correct. Not sure whether the problem is in the CONCAT or the ENCODE_FOR_URI function.

This query can be run on https://data.gov.cz/sparql or https://dev.nkod.opendata.cz/sparql:

CONSTRUCT {
  ?a ?b ?c, ?d  .
}
WHERE {
  ?test a dcat:Dataset .
  BIND(IRI(CONCAT("https://a/é/", ENCODE_FOR_URI("Á"))) as ?a)
  BIND(IRI(CONCAT("https://b/e/", ENCODE_FOR_URI("Á"))) as ?b)
  BIND(CONCAT("https://c/é/", ENCODE_FOR_URI("Á")) as ?c)
  BIND(CONCAT("https://d/e/", ENCODE_FOR_URI("Á")) as ?d)
}

Note that ?test a dcat:Dataset . is not necessary and can be replaced by anything which matches something in the graph. It could be omitted, but that triggers this 6,5 years old issue: https://github.com/openlink/virtuoso-opensource/issues/231 when run directly on the Virtuoso SPARQL Endpoint.

When run in Yasgui (https://api.triplydb.com/s/8E0WDV550) it works even without this.

jakubklimek avatar Jan 26 '21 15:01 jakubklimek

@jakubklimek -- Note that you can drop the ?test a dcat:Dataset . pattern and run the query against both of your listed endpoints, if you either un-tick the box for "Strict checking of void variables" on the SPARQL query form (as noted in the comments on @231) or insert define sql:signal-void-variables 0 before CONSTRUCT in your query. (The define option also works through saved URLs, as on data.gov.cz (query, results) or dev.nkod.opendata.cz (query, results).)

Is there a reason you're using a CONSTRUCT query to test, instead of a SELECT? (At a quick glance, the encoding issue appears to happen in both; I just want to be sure I'm not missing something.)

@smalinin @pkleef @iv-an-ru -- Please take a look at this.

TallTed avatar Jan 26 '21 17:01 TallTed

@TallTed thanks, I knew there was a workaround for this somewhere.

I used CONSTRUCT just because that is how I discovered the bug and went on minimizing the example, no other reason.

jakubklimek avatar Jan 26 '21 17:01 jakubklimek

I ran into this issue even without ENCODE_FOR_URI. It therefore seems to be contained to CONCAT. Whenever there is a unicode character used in CONCAT, the result is badly encoded:

PREFIX dcat: <http://www.w3.org/ns/dcat#>

SELECT ?changed WHERE {
  ?dataset a dcat:Dataset .
  BIND(CONCAT("ě", ?dataset) AS ?changed)
}
LIMIT 1

— produces ěhttps://data.gov.cz/zdroj/datové-sady/https---isdv.upv.cz-opendata-upv-package_show-id-vz20210307diff while —

PREFIX dcat: <http://www.w3.org/ns/dcat#>

SELECT ?changed WHERE {
  ?dataset a dcat:Dataset .
  BIND(CONCAT("e", ?dataset) AS ?changed)
}
LIMIT 1

— produces ehttps://data.gov.cz/zdroj/datové-sady/https---isdv.upv.cz-opendata-upv-package_show-id-vz20210307diff

(note the first character and then datové-sady vs datové-sady)

jakubklimek avatar Apr 22 '21 05:04 jakubklimek

Still happening in https://github.com/openlink/virtuoso-opensource/commit/8baf8a90afc842c52b7d2f44af0ca99c88d85b68

jakubklimek avatar May 01 '21 04:05 jakubklimek

Still happening in a7b01eced76532f1fa36fdf665f9f836531bdae0

jakubklimek avatar Jul 23 '21 11:07 jakubklimek

@smalinin @pkleef @iv-an-ru @hughwilliams @openlink -- Any estimate of when this will be investigated, if not resolved? It seems likely to be causing trouble if not blocking a good number of deployments where Unicode is in broader use.

TallTed avatar Jul 23 '21 14:07 TallTed

@pkleef any chance of looking into this when you are dealing with unicode related issues? :)

jakubklimek avatar Apr 12 '22 12:04 jakubklimek