rascal
rascal copied to clipboard
Syntax holes that range over regular expressions (except lists) do not parse correctly
Describe the bug
Reported by @stevenKlusener; thanks!
Simplified as follows:
module SequenceMatch
syntax To = "TO" ("a" ",");
layout WS = [\ \t\n\r]*;
bool main() {
return (To) `TO <("a" ",") _x>`
:= (To) `TO a,`;
}
- This will give a parse error in the concrete syntax pattern with the sequence hole
<("a" ",") _x> - More complex code with a list around the sequence, or with non-terminals instead of literals in the sequence, will fail in the same way.
- It is expected that a subtle bug in the generation of the grammar for the holes is the cause of this issue.
This is the rule that is generated for the sequence hole:
prod(
label(
"$MetaHole",
seq([
lit("a"),
layouts("L"),
lit(",")
])),
[
\char-class([range(0,0)]),
lit("seq([lit(\"a\"),layouts(\"L\"),lit(\",\")])"),
lit(":"),
iter(\char-class([range(48,57)])),
\char-class([range(0,0)])
],
{tag("holeType"(seq([
lit("a"),
layouts("L"),
lit(",")
])))})
})
This is the hole string that is generated for the pattern:
seq([lit("a"),lit(",")]):0
So this parse error is caused by the missing ,layouts(\"L\"), in this generated string.
- the code needs to know both if a layout non-terminal must be intertwined, and which layout non-terminal this would be.
- but that information is not (yet) present in the simple code that parses the concrete fragments.
What would happen if we remove layout from the syntax of the hole? Would that become potentially ambiguous? I have to think about this.
It could become ambiguous, but only in a very convoluted case where the original grammar would have a problematic ambiguity itself already:
In one module:
layout X = ...;
syntax A = "A" ("a" "b");
and then in another, exactly this:
layout Y = ...;
syntax A = "A" ("a" "b");
With the current solution, a hole <("a" "b") _> would be different, one for X and one for Y. With the proposed solution it would be the same and the variable type would not be able to disambiguate between the two cases anymore.
Aha.. interesting:
private Symbol denormalize(Symbol s) = visit (s) {
case \lex(n) => \sort(n)
case \iter-seps(u,[layouts(_),t,layouts(_)]) => \iter-seps(u,[t])
case \iter-star-seps(u,[layouts(_),t,layouts(_)]) => \iter-star-seps(u,[t])
case \iter-seps(u,[layouts(_)]) => \iter(u)
case \iter-star-seps(u,[layouts(_)]) => \iter-star(u)
// TODO: add rule for seq
};
:-P
102ad2d2d89a6d6a0da1eb4f2903f928471a8304 makes sure that the grammar is correct for parsing concrete syntax fragments that have the sequence symbol in them. However, the parser can not parse these new alternatives for regular expressions because for all regulars the parser has specialized stack classes that do not use the additionally generated alternatives.
This is a broader problem that affects all regular symbols except lists. The reason is that list variables have been modeled as list elements in the grammar.