rascal icon indicating copy to clipboard operation
rascal copied to clipboard

Syntax holes that range over regular expressions (except lists) do not parse correctly

Open jurgenvinju opened this issue 2 years ago • 8 comments

Describe the bug

Reported by @stevenKlusener; thanks!

Simplified as follows:

module SequenceMatch

syntax To = "TO" ("a" ",");

layout WS = [\ \t\n\r]*;

bool main() {
    return (To) `TO <("a" ",") _x>` 
        := (To) `TO a,`;
}
  • This will give a parse error in the concrete syntax pattern with the sequence hole <("a" ",") _x>
  • More complex code with a list around the sequence, or with non-terminals instead of literals in the sequence, will fail in the same way.
  • It is expected that a subtle bug in the generation of the grammar for the holes is the cause of this issue.

jurgenvinju avatar Aug 15 '23 07:08 jurgenvinju

This is the rule that is generated for the sequence hole:

prod(
          label(
            "$MetaHole",
            seq([
                lit("a"),
                layouts("L"),
                lit(",")
              ])),
          [
            \char-class([range(0,0)]),
            lit("seq([lit(\"a\"),layouts(\"L\"),lit(\",\")])"),
            lit(":"),
            iter(\char-class([range(48,57)])),
            \char-class([range(0,0)])
          ],
          {tag("holeType"(seq([
                    lit("a"),
                    layouts("L"),
                    lit(",")
                  ])))})
      })

jurgenvinju avatar Aug 15 '23 07:08 jurgenvinju

This is the hole string that is generated for the pattern:

seq([lit("a"),lit(",")]):0

jurgenvinju avatar Aug 15 '23 07:08 jurgenvinju

So this parse error is caused by the missing ,layouts(\"L\"), in this generated string.

jurgenvinju avatar Aug 15 '23 07:08 jurgenvinju

  • the code needs to know both if a layout non-terminal must be intertwined, and which layout non-terminal this would be.
  • but that information is not (yet) present in the simple code that parses the concrete fragments.

jurgenvinju avatar Aug 15 '23 07:08 jurgenvinju

What would happen if we remove layout from the syntax of the hole? Would that become potentially ambiguous? I have to think about this.

jurgenvinju avatar Aug 15 '23 07:08 jurgenvinju

It could become ambiguous, but only in a very convoluted case where the original grammar would have a problematic ambiguity itself already:

In one module:

layout X = ...;
syntax A = "A" ("a" "b");

and then in another, exactly this:

layout Y = ...;
syntax A = "A" ("a" "b");

With the current solution, a hole <("a" "b") _> would be different, one for X and one for Y. With the proposed solution it would be the same and the variable type would not be able to disambiguate between the two cases anymore.

jurgenvinju avatar Aug 15 '23 07:08 jurgenvinju

Aha.. interesting:

private Symbol denormalize(Symbol s) = visit (s) { 
  case \lex(n) => \sort(n)
  case \iter-seps(u,[layouts(_),t,layouts(_)]) => \iter-seps(u,[t])
  case \iter-star-seps(u,[layouts(_),t,layouts(_)]) => \iter-star-seps(u,[t])
  case \iter-seps(u,[layouts(_)]) => \iter(u)
  case \iter-star-seps(u,[layouts(_)]) => \iter-star(u)
  // TODO: add rule for seq
};

:-P

jurgenvinju avatar Aug 15 '23 07:08 jurgenvinju

102ad2d2d89a6d6a0da1eb4f2903f928471a8304 makes sure that the grammar is correct for parsing concrete syntax fragments that have the sequence symbol in them. However, the parser can not parse these new alternatives for regular expressions because for all regulars the parser has specialized stack classes that do not use the additionally generated alternatives.

This is a broader problem that affects all regular symbols except lists. The reason is that list variables have been modeled as list elements in the grammar.

jurgenvinju avatar Aug 15 '23 12:08 jurgenvinju