webidl icon indicating copy to clipboard operation
webidl copied to clipboard

Ellipsis token quirkiness?

Open bathos opened this issue 6 years ago • 3 comments

The ellipsis token has a unique property: it’s the only terminal whose source text isn’t matched by any of the seven regular expressions given at the start of the grammar section.

This isn’t, to my knowledge, an error: the regular expressions only describe the ‘named terminals,’ which are considered distinct from the unnamed terminals that are given in literal teletype throughout the grammar and which take precedence:

If the longest possible match could match one of the above named terminal symbols or one of the other terminal symbols from the grammar, it must be tokenized as the latter.

That said, this has big footgun energy imo. It’s tempting to lex using the regex as your goal and then refine the result by changing the type to 'unnamed' if the value is a member of that set. This initially appears to be possible because every unnamed nonterminal can be matched as one of the named terminals first ... except the ellipsis. I think it’s pretty easy to miss that when there’s one exception out of 83.

My suggestion would be to change the definition of the other named token in order to restore the every-nonterminal-is-a-member-of-the-set-of-strings-described-by-these-regex-patterns property:

/[^\t\n\r 0-9A-Za-z]/ -> /\.{3}|[^\t\n\r 0-9A-Za-z]/

It’s possible this doesn’t matter to other folks — the spec isn’t actually ambiguous here or anything — in which case feel free to close this, but it also seems possible this terminal being unique in this regard was unintentional to begin with.

bathos avatar Oct 05 '19 06:10 bathos

@heycam

bzbarsky avatar Oct 05 '19 06:10 bzbarsky

There's a similar issue with (, [, { etc. The grammar spec says "Note: The Other non-terminal matches any single terminal symbol except for (, ), [, ], {, } and ,." but the Other non-terminal includes the other terminal, which is defined as /[^\t\n\r 0-9A-Za-z]/ and so Other definitely does match (, [, { etc on first glance.

Like you say, this is handwaved away by the "...one of the other terminal symbols from the grammar, it must be tokenized as the latter" text, but it's a big footgun.

cscott avatar Jan 27 '21 22:01 cscott

Yeah, the spooky-action-at-a-distance — that any terminal literal appearing in the syntactic grammar is implicitly excised from the set of strings belonging to any of the regexp-defined terminal languages — has historically led to repeated issues with Other I think, since it probably isn’t super obvious that by using any new literal terminal somewhere you’re actually altering the lexical language “globally” and not just the syntactic language “locally”.

This seems to happen in reverse too where Other can end up with vestiges. Looking at it right now I can see that "." and "-" appear in Other as alternatives. This has the effect of making them unique terminals which other (lowercase) doesn’t match. But ... because they appear nowhere else, and because Other includes other and nothing else does, and because no alternatives of Other have defined semantics (that being the idea), this is a tautology — "." and "-" existing there has the same effect as them not existing, they are pure complications.

bathos avatar Mar 29 '21 08:03 bathos