qlever icon indicating copy to clipboard operation
qlever copied to clipboard

Handling Escaped characters (ECHAR in Sparql/Turtle Grammar) is wrong for the SparqlParser, the TurtleParser and the Regex Filter Parser

Open hannahbast opened this issue 5 years ago • 3 comments

The following query takes 164 seconds on http://qlever.informatik.uni-freiburg.de/Wikidata_Full :

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX schema: <http://schema.org/>
SELECT ?x WHERE {
  ?x rdf:type schema:Article .
  FILTER regex(?x, "^<https://en.wikipedia.org/wiki/Albert_Ein")
}

Here are the specs from the execution tree:

FILTER ?x regex ^<https://en.wikipedia.org/wiki/Albert_Ein
Size: 29 x 1
Cols: ?x
Time: 162,932ms

INDEX SCAN ?x <Article>
Size: 67,677,598 x 1
Cols: ?x
Time: 248ms

I though that a prefix regex FILTER is implemented by doing one or two binary searches on the sorted IDs and then manifesting strings only for the result IDs (only 29 in this case).

However, the high query time indicates that the strings are looked up for all 67,677,598 IDs. Why?

hannahbast avatar Dec 14 '19 21:12 hannahbast

Fixed via #295

joka921 avatar Dec 24 '19 11:12 joka921

@joka921 I don't understand how this is fixed by #295 (which was about the pattern trick not being used in some cases). The problem for the query above is that the prefix FILTER takes forever, although it could be fast.

I just tried the query again on the current version (where #295 has been incorporated) and the problem is still there.

hannahbast avatar Dec 24 '19 18:12 hannahbast

I had a look at this and found the following:

  • The actual problem is simple, "^<https://en.wikipedia.org/wiki/Albert_Ein" is not a simple prefix regex but contains a . which is "match any character". So the actual behavior in your case is correct.

  • You probably wanted to escape the ., to my understanding this should be done by using two backslashes, once for Sparql and one for the regexengine, so, FILTER regex(?x, "^<https://en\\.wikipedia\\.org/wiki/Albert_Ein")

  • This escaping is broken on very many Levels in the current parsing (The actual lexing regex is wrong, the handling of the escapes in the regex filter parser is wrong and the Sparql escape handling is currently nonexisting. I will have a closer look at this.

joka921 avatar Jan 03 '20 12:01 joka921