Keywords match if they are prefixes of terminals using unicode flag
Langium version: 3.2.0
Steps To Reproduce
Use the following grammar:
grammar HelloWorld
entry Model:
'keyword' value=ID;
hidden terminal WS: /\s+/;
terminal ID: /[A-Za-z]+/;
and apply it to the following text:
keyword keywordSomeSuffix
This should show no errors. Now add a Unicode Flag ("u") behind the last slash of the terminal ID line. Now an error should show that complains that behind the keyword at the start of the text, an ID is expected, but another keyword was found.
Link to code example: https://langium.org/playground?grammar=OYJwhgthYgBAEgUwDbIPYHU0mQEwFD6IB2ALiAJ6wCyauKAXPrC7AOQDWiFA7trm1gA3MMgCuiALwBJACIBuQgAsAlrnrFYpRCAgrio2BgDKDWAHoAOgGcA1OcXbd%2Bw3LPmA2gEEAtAC0wHwAvAF17MXkgA&content=NYUwng7g9gTgJgAlJWcDKUC2I0FcBm%2BAlgB5A
The current behavior
An error is shown that complains that behind the keyword, an ID is expected, but another keyword was found.
The expected behavior
The keyword should not be matched as it is not expected from the current rule and not isolated
Hm, that one is pretty annoying. For our tokenizer, we need to explicitly supply the LONGER_ALT property, see:
https://github.com/eclipse-langium/langium/blob/516fe4cc0a520d9a88e53409f5effd1aa3b6f2eb/packages/langium/src/parser/token-builder.ts#L139
Computing that property can be done for RegExp (and we do it) however, adding the u flag transforms the RegExp into a function matcher pattern, which is no longer subject to the LONGER_ALT evaluation.
Fixing this in your own language is just a small matter of overriding the TokenBuilder and adding the token to the LONGER_ALT list manually. Fixing this on a generic level will be a bit more difficult.
Thank you for the help!
For anyone else encountering this issue in the meantime, I solved it by following the idea and ended up with a new TokenBuilder class like this:
import { TokenType } from "chevrotain";
import { DefaultTokenBuilder } from "langium";
interface Keyword {
value: string;
}
export class TokenBuilder extends DefaultTokenBuilder {
override findLongerAlt(keyword: Keyword, terminalTokens: TokenType[]): TokenType[] {
const result = super.findLongerAlt(keyword as any, terminalTokens);
if (terminalTokens && keyword.value.match(/[A-Za-z]+/)) {
const idToken = terminalTokens.find(it => it.name == "ID");
if (!idToken) {
throw new Error("ID token not found");
}
result.push(idToken);
}
return result;
}
}
The Keyword interface is a workaround since I could not figure out how to access the real Keyword type used in findLongerAlt from the library. In my case, the terminal that uses the Unicode flag is named ID. I add it to the return value of super.findLongerAlt whenever the keyword consists only of letters. Theoretically I should not just use [A-Za-z] in the regular expression, but match on all Unicode characters, but since my keywords do not contain any non Unicode letters, this should be fine.
@c-classen The real Keyword type can be found on the GrammarAST namespace (importable from langium):
import type { GrammarAST } from "langium";
export class TokenBuilder extends DefaultTokenBuilder {
protected override findLongerAlt(keyword: GrammarAST.Keyword, terminalTokens: TokenType[]): TokenType[] {
// ...
}
}