instaparse icon indicating copy to clipboard operation
instaparse copied to clipboard

EBNF exception symbol support

Open eerohele opened this issue 8 years ago • 5 comments

I'd like to use Instaparse to parse XPath expressions. I have an EBNF grammar that works otherwise (well, I think so, at least), but there are two rules that don't work:

NCName          ::=     Name - (Char* ':' Char*)    /* An XML Name, minus the ":" */

And:

CommentContents ::=     (Char+ - (Char* ('(:' | ':)') Char*))

Where Char is:

Char            ::=     #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]  /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

The rule for Name is a bit longer so I won't copy-paste it here, but it's available here.

When parsing my EBNF file, Instaparse throws this error:

java.lang.RuntimeException: - occurs on the right-hand side of your grammar, but not on the left

If I understand correctly, Instaparse doesn't support the EBNF exception symbol. If so, are there any plans to support it, or is my best bet to try to rewrite those rules using regular expressions? I'm just asking because the rules for Char and Name are pretty hefty, so I'm not sure what the best approach here is.

eerohele avatar Nov 05 '15 07:11 eerohele

There are a few different versions of the EBNF standard floating around, and whatever version I originally consulted didn't have a reference to the exception symbol, so this is the first I'm hearing about it. I just looked it up, though, so I know what you're talking about.

It is certainly a goal of instaparse to make it possible to just paste in standard EBNF grammars with little to no modification, so now that I know about it, I'd like to eventually investigate this and get it included. It appears that the standard severely restricts what can come after the - symbol. To avoid problematic recursion, it looks like the spec says the right-hand side needs to expand to something simple (like an alternation of plain symbols), so I'm not sure your example for NCName and CommentContents would even fall within the scope of the spec's definition of the exception symbol.

In the meantime, it seems to me that negative lookahead should be a viable substitute for the exception symbol. Simply translate A - B to (!B) A and I think that should work.

However, you'll get the best performance if you can translate these rules into regexes. Also, your rule for Char should probably be a regex since Instaparse's ebnf mode only supports character ranges through regexes. (Instaparse's ABNF mode does directly support character ranges, but it is a slightly different syntax - see https://github.com/Engelberg/instaparse/blob/master/docs/ABNF.md).

Engelberg avatar Nov 05 '15 08:11 Engelberg

Thanks for the quick reply!

I will give negative lookaheads a go. Regarding character ranges, I've already translated them into regexps, although I'm not quite sure whether I've got the syntax 100% right. For example, I changed Char to:

Char ::= #"\\u9"
       | #"\\uA"
       | #"\\uD"
       | #"[\\u20-\\uD7FF]"
       | #"[\\uE000-\\uFFFD]"
       | #"[\\u10000-\\u10FFFF]"

eerohele avatar Nov 05 '15 09:11 eerohele

A few things I noticed:

  1. You need \x{123456} instead of \u123456 for Unicode code points that aren't 4 digits.
  2. Make sure you know how many backslashes you want for those escape characters. You need only ONE backslash if you're reading the parser from a file, but you need TWO if you're working in a string in Clojure code.
  3. The most performant option is to combine all the char ranges into one regex.

Here is my edited version, with those three points taken into consideration (assuming you want single backslash):

Char ::= #"[\x{9}\x{A}\x{D}\x{20}-\uD7FF\uE000-\uFFFD\x{10000}-\x{10FFFF}]"

aengelberg avatar Nov 05 '15 22:11 aengelberg

@aengelberg: Many thanks for the suggestions! Your version of Char works great, and I can use the information you provided to fix the other rules, too.

I haven't yet quite managed to wrangle CommentContents and NCName into regexps that work perfectly, but that's due to my lacking regexp-fu, not Instaparse. I'll keep working on it and post the EBNF I end up with here in case someone else finds it useful.

In the meantime, you can close this issue as far as I'm concerned, unless you want to keep it open for tracking the exception symbol issue.

eerohele avatar Nov 09 '15 07:11 eerohele

Glad to hear you're on the right track now. I'm going to keep the issue open for the exception symbol.

Engelberg avatar Nov 10 '15 22:11 Engelberg