qlever icon indicating copy to clipboard operation
qlever copied to clipboard

Quotes not escaped in result for CONSTRUCT queries

Open hannahbast opened this issue 1 year ago • 7 comments

The following CONSTRUCT query on the olympics dataset yields an invalid triple, where the literal contains unescaped string:

curl -s https://qlever.cs.uni-freiburg.de/api/olympics -H "Accept: text/tab-separated-values" -H "Content-type: application/sparql-query" --data "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX athlete: <http://wallscope.co.uk/resource/olympics/athlete/> CONSTRUCT { ?athlete rdfs:label ?athlete_name } WHERE { VALUES ?athlete { athlete:GabrielleMarieGabbyAdcockWhite } ?athlete rdfs:label ?athlete_name }"

Here is the result (a single line):

<http://wallscope.co.uk/resource/olympics/athlete/GabrielleMarieGabbyAdcockWhite>	<http://www.w3.org/2000/01/rdf-schema#label>	"  Gabrielle Marie "Gabby" Adcock (White-)"@en

hannahbast avatar Sep 23 '22 16:09 hannahbast

@RobinTF @joka921 Is this an old bug or did we introduce this sometime after the PRs for the CONSTRUCT queries?

hannahbast avatar Sep 23 '22 16:09 hannahbast

@hannahbast I'm pretty sure I considered this when implementing this originally (see #536) and the mechanism seems to still be in place: https://github.com/ad-freiburg/qlever/blob/dd61a6aa620b6e6b00e9f7eff6900dd8b4de7859/src/parser/RdfEscaping.cpp#L265-L278 However here's my guess on what's going on: The TSV format is not as well-specified like CSV, therefore there are fewer rules regarding escaping: https://github.com/ad-freiburg/qlever/blob/dd61a6aa620b6e6b00e9f7eff6900dd8b4de7859/src/parser/RdfEscaping.h#L76-L83

So for the TSV format there is no need to add escaping because the tabs are unambiguous, it is clear where the value starts and stops. But " Gabrielle Marie "Gabby" Adcock (White-)"@en is not a regular RDF literal, it is a quoted literal (see https://www.w3.org/TR/turtle/#turtle-literals ) and it looks like this quoted literal has not been properly escaped either before inserting it into the knowledge base, or after reading it from the knowledge base and turning it into a quoted literal (not sure how it is stored internally). So the string should really print:

<http://wallscope.co.uk/resource/olympics/athlete/GabrielleMarieGabbyAdcockWhite>	<http://www.w3.org/2000/01/rdf-schema#label>	'  Gabrielle Marie "Gabby" Adcock (White-)'@en

or

<http://wallscope.co.uk/resource/olympics/athlete/GabrielleMarieGabbyAdcockWhite>	<http://www.w3.org/2000/01/rdf-schema#label>	'''  Gabrielle Marie "Gabby" Adcock (White-)'''@en

or

<http://wallscope.co.uk/resource/olympics/athlete/GabrielleMarieGabbyAdcockWhite>	<http://www.w3.org/2000/01/rdf-schema#label>	"""  Gabrielle Marie "Gabby" Adcock (White-)"""@en

But this issue is unrelated to the TSV format (when using CSV it should escape the quotes, but after unescaping the sequence would again be invalid just like before escaping). So I presume this is an old bug, potentially present for SELECT queries too.

RobinTF avatar Sep 23 '22 17:09 RobinTF

@RobinTF Thanks a lot for the quick answer. I just realized that I made a slight mistake when I posted the issue. I wanted to paste the curl request with text/turtle accept header. This gives the same mistake, which means that the result is not valid Turtle. I guess, the question then is why the code you quoted is not activated.

curl -s https://qlever.cs.uni-freiburg.de/api/olympics -H "Accept: text/turtle" -H "Content-type: application/sparql-query" --data "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX athlete: <http://wallscope.co.uk/resource/olympics/athlete/> CONSTRUCT { ?athlete rdfs:label ?athlete_name } WHERE { VALUES ?athlete { athlete:GabrielleMarieGabbyAdcockWhite } ?athlete rdfs:label ?athlete_name }"

Here is how the literal looks like in the original data set (which contains valid NT resp. Turtle):

"  Gabrielle Marie \"Gabby\" Adcock (White-)"@en

hannahbast avatar Sep 23 '22 22:09 hannahbast

@hannahbast

I guess, the question then is why the code you quoted is not activated.

My escaping code is only designed to deliver RDF values via CSV or TSV so that a parser get's the original values back. What's happening here though is that the data is already invalid RDF in the first place. I don't know where exactly in the code the rdf tags are created, but the step that creates the RDF tag is not properly applying the escaping the RDF standard defines.

TL;DR My code assumes valid RDF and would escape it properly, but for invalid RDF my code does not help and is not supposed to help either.

RobinTF avatar Sep 24 '22 00:09 RobinTF

@RobinTF Are you saying that " Gabrielle Marie \"Gabby\" Adcock (White-)"@en (this is how the literal looks like in the input data) is an invalid RDF literal?

hannahbast avatar Sep 24 '22 00:09 hannahbast

@hannahbast " Gabrielle Marie \"Gabby\" Adcock (White-)"@en is a valid RDF quoted literal due to the escape sequences being properly used. What I'm saying is that the string Gabrielle Marie "Gabby" Adcock (White-) is most likely stored in internal memory and when turning the string back into a literal the code forgets to escape the string again and simply adds quotes to both ends. This has nothing to do with the actual export of the data, the piece of code that assembles the quoted RDF literal is the problem (which I haven't touched ever, so no idea where it is). The "export" steps just adds an additional escape layer on top depending on the output format. The outer escaping layer (which I implemented) is fine, the inner layer has this bug.

RobinTF avatar Sep 24 '22 01:09 RobinTF

To my surprise, this bug is still there. I thought it was fixed by #843

hannahbast avatar May 28 '23 15:05 hannahbast