instaparse icon indicating copy to clipboard operation
instaparse copied to clipboard

EBNF "special sequence" causes parse error

Open pmonks opened this issue 7 months ago • 4 comments

An EBNF "special sequence" (? ... ?) does not appear to be parsed correctly by instaparse. For example, this EBNF fails with an error on line 162, at column 28 (the first question mark character that starts the first of two special sequences).

Steps to reproduce

  1. Start a REPL with instaparse in the classpath
  2. Run this code:
(require '[clojure.string :as str])
(require '[instaparse.core :as ip])

(def ebnf-with-examples (slurp "https://raw.githubusercontent.com/dariusz-wozniak/fuzzy-dates/refs/heads/main/grammar/fuzzy-date.ebnf"))
(def ebnf-only (first (str/split ebnf-with-examples #"\Q(* --- Examples --- *)\E")))
(def p (ip/parser ebnf-only))

Expected result

p contains an instaparse parser for this EBNF grammar.

Actual result

Execution error at instaparse.util/throw-runtime-exception (util.clj:7).
Error parsing grammar specification:
Parse error at line 162, column 28:
character_in_calendar_id = ? any printable character except ')' and newline ? ;
                           ^
Expected one of:
!
&
ε
eps
EPSILON
epsilon
Epsilon
<
(
{
[
#"#\"[^\"\\]*(?:\\.[^\"\\]*)*\"(?x) #Double-quoted regexp"
#"#'[^'\\]*(?:\\.[^'\\]*)*'(?x) #Single-quoted regexp"
#"\"[^\"\\]*(?:\\.[^\"\\]*)*\"(?x) #Double-quoted string"
#"'[^'\\]*(?:\\.[^'\\]*)*'(?x) #Single-quoted string"
(*
#"[^, \r\t\n<>(){}\[\]+*?:=|'"#&!;./]+(?x) #Non-terminal"

Other considerations

I realise that properly supporting EBNF special sequences opens a can of worms around how their contents are to be interpreted, but at a minimum a more specific error message would be valuable.

pmonks avatar Jun 18 '25 03:06 pmonks

A workaround for this specific grammar only is:

(def ebnf (-> ebnf-only
              (str/replace "? any printable character except newline ?" "#\"[\\p{Print}&&[^\\n]]*\"")
              (str/replace "? any printable character except ')' and newline ?" "#\"[\\p{Print}&&[^\\n\\)]]*\"")))
(def f (ip/parser ebnf))

This then results in another parse error, however that one is due to a bug in the grammar itself.

pmonks avatar Jun 18 '25 03:06 pmonks

This part of the grammar character_in_calendar_id = ? any printable character except ')' and newline ? ; doesn't really look to me like something that is meant to be executable as a precise specification. It appears to be just a comment to human readers, which would then need to be interpreted. In the context of instaparse, you'd need to rewrite it as a regex, which it looks like you've already figured out how to do.

Engelberg avatar Jun 18 '25 22:06 Engelberg

I understand that sometimes in descriptive EBNF's, ? characters are used to demarcate these human-readable text sections that need to be interpreted, but instaparse has no ability to interpret or assign meaning to this sort of free-form textual description, so any such section would always have to be rewritten. Since there's no reasonable way to support these "special sequences" which are, by definition, all the things that need to be described in human language because they are too imprecise to be a formal, actionable specification, I instead focused on the standard of ? to mean optional (i.e., 0 or 1), which is documented.

Engelberg avatar Jun 18 '25 22:06 Engelberg

The ?-delimited sections ("special sequence") are officially part of the EBNF specification (sections 4.19 and 4.20), but as you say knowing what to do with the contents of a special sequence is beyond what instaparse can do by itself.

IMHO an MVP would be to parse special sequences as per the spec but then return / throw a specific error - perhaps something like your EBNF grammar contains a special sequence (<rule name>: "<content of sequence>"). EBNF special sequences are not supported by instaparse..

A valuable middle ground might be to accept a predefined set of special sequence values e.g. "all printable characters", POSIX regexes, etc. Anything else would still return / throw some kind of "unsupported special sequence value" error.

A more sophisticated hypothetical future solution might involve providing an affordance for callers to specify replacements for special sequence content - e.g. as a map of replacements, or via a callback, or whatever.

pmonks avatar Jun 19 '25 16:06 pmonks