qlever icon indicating copy to clipboard operation
qlever copied to clipboard

Write a Proper SparQL parser using Antlr4

Open joka921 opened this issue 5 years ago • 9 comments

This subsumes many of the other issues below.

  • First step:
    • Implement a Sparql parser that supports exactly the same subset as the current one, but with a better structure and correct/automated lexing.

joka921 avatar Apr 16 '19 10:04 joka921

@manonthegithub might also be interested in your progress on this so we don't duplicate the work

niklas88 avatar Apr 16 '19 10:04 niklas88

@joka921 @manonthegithub on the internal fork we do have an ANTLR SPARQL grammar for the completion script that could be used for this.

niklas88 avatar Apr 16 '19 10:04 niklas88

I have seen this grammar and am already using it

joka921 avatar Apr 16 '19 10:04 joka921

Adding to this the current SPARQL parser also breaks if there isn't a space before . at the end of a triple which is often the case for Wikidata examples.

niklas88 avatar May 15 '19 08:05 niklas88

Ok I tried quickfixing the . issue because it just happens so often. Turns out SPARQL is quite weird here because the . may appear inside literals, prefixed names and IRIs.

For example the following query works (in Blazegraph):

SELECT ?item WHERE {
  ?item wdt:P31 wd:Q2934.?item wdt:P39 wd:Q41240317
}

Using ^ reversing the following also works

SELECT ?item WHERE {
  ?item wdt:P31 wd:Q2934. wd:Q41240317 ^wdt:P39 ?item
}

However removing the after the . breaks parsing even though it's not needed at the same position when the ? disambiguates. So yeah we really should use a proper parser that naturally handles this weirdness.

niklas88 avatar Jun 06 '19 08:06 niklas88

@joka921 note that the current ANTLR grammar doesn't support the predicate paths that #244 will soon add. I'll look into this so beware there will be some changes.

niklas88 avatar Jun 06 '19 09:06 niklas88

@floriankramer just a note that this would also add # comments which aren't supported by the new lexer either.

niklas88 avatar Aug 14 '19 12:08 niklas88

@niklas88 Although adding those into the lexer would be relatively easy (simply consume everything up to and including the next newline when a # is found outside of another token type).

floriankramer avatar Aug 14 '19 13:08 floriankramer

Update:

We finally are making progress on this. We already have a complete grammar and the and it is now assigned to @Qup42

joka921 avatar May 15 '22 14:05 joka921

This has been done and it was indeed a milestone for QLever.

hannahbast avatar May 28 '23 14:05 hannahbast