lilt icon indicating copy to clipboard operation
lilt copied to clipboard

Generate passing/failing test cases for lilt specifications

Open Quelklef opened this issue 7 years ago • 1 comments

Sometimes when planning syntax and writing parsers, you overlook some kind of ambiguity in the parser or other issue which could be a major problem down the road.

For instance, a language may allow user-defined operators such as #%, @@, *&**^!, etc. This language may then introduce special syntax for ranges, as follows: (a .. b), (a .. b], [a .. b) and [a .. b]. Now, if a user were to define the .. operator, (a .. b) becomes ambiguous: is it (.. A B), or (RANGE A B)?

One way to remove this ambiguity would be to check through all Choice nodes in the Lilt AST and verify that there does not exist a piece of text such that two or more of the Choice's children match it.

That is, we may define Expr: Parens | Range | Operator | ... and Parens: "(" Expr ")". Then, we'd note that (a .. b) works for both Operator and Parens <- Range, so Expr is technically ambiguous.

However, I think not doing this is desirable for another reason. We can instead say that Choices will match whichever of their children match first. This way, both OpCall: Identifier Operator Identifier and Reference: Identifier can be in Expr: OpCall | Reference | .... As long as OpCall comes before Reference in Expr, it will take precedence and allow for both syntaxes, even though, in a sense, it's ambiguous*.

For this reasons I think it's best to have Choice work by precedence and not disallow ambiguity*. Perhaps it would be useful to mark two rules in a Choice which we want to match the same things, and disallow the rest from being ambiguous, thus getting the best of both worlds. This is perhaps a good project for the future.

* Technically, they're not ambiguous because the ambiguity is removed by making a choice, that is by choosing the first matching rule. I use the term "ambiguous" here to get the point across, even if it's technically wrong.

Another possible solution is as follows: Given a Lilt specification, we can generate example bits of code which either matches or does not match it. Take, for example:

ident: +<abcdefghijklmnopqrstuvwxyz>
args: ?ident *[", " ident]
funcdecl: "function " ident "(" args ");"

We could then generate the following example cases for passing:

function asjnegjw(ksfs, eginwi, askdna, gwn);
function l(op, inwgeegnewwon, sigiooig);
function aaaaaaaa(me, you, them);
etc.

and the following test cases for failing:

function (jasfk, gubw, ge);
function asjf()
estsa(inegniwn, isin, asngi);
functionmajs(ag, hnir, wigwn);
etc.

By doing this, the user is able to, at a glance, see what their Lilt specification amounts to. If they see a positive or negative case that should be in the other category, they know the spec is wrong. I think this would be especially useful to see issues with whitespace in the spec.

Of course, neither sets of test cases should be generated blindly. There should be a tendency to favor edge cases, and OnePlus rules shouldn't go on for too long. (I meant to mention something else here but now I've forgotten)

By showing the generated ASTs corresponding to the positive test cases, this method could act as a solution for the problem presented earlier.

Quelklef avatar Jan 15 '18 16:01 Quelklef

Much better to do an analysis on the AST and report ambiguous Choices.

Quelklef avatar Feb 23 '18 23:02 Quelklef