langium Keywords match if they are prefixes of terminals using unicode flag

Langium version: 3.2.0

Steps To Reproduce

Use the following grammar:

grammar HelloWorld

entry Model:
    'keyword' value=ID;

hidden terminal WS: /\s+/;
terminal ID: /[A-Za-z]+/;

and apply it to the following text:

keyword keywordSomeSuffix

This should show no errors. Now add a Unicode Flag ("u") behind the last slash of the terminal ID line. Now an error should show that complains that behind the keyword at the start of the text, an ID is expected, but another keyword was found.

Link to code example: https://langium.org/playground?grammar=OYJwhgthYgBAEgUwDbIPYHU0mQEwFD6IB2ALiAJ6wCyauKAXPrC7AOQDWiFA7trm1gA3MMgCuiALwBJACIBuQgAsAlrnrFYpRCAgrio2BgDKDWAHoAOgGcA1OcXbd%2Bw3LPmA2gEEAtAC0wHwAvAF17MXkgA&content=NYUwng7g9gTgJgAlJWcDKUC2I0FcBm%2BAlgB5A

The current behavior

An error is shown that complains that behind the keyword, an ID is expected, but another keyword was found.

The expected behavior

The keyword should not be matched as it is not expected from the current rule and not isolated

Oct 29 '24 14:10 c-classen

Hm, that one is pretty annoying. For our tokenizer, we need to explicitly supply the LONGER_ALT property, see:

https://github.com/eclipse-langium/langium/blob/516fe4cc0a520d9a88e53409f5effd1aa3b6f2eb/packages/langium/src/parser/token-builder.ts#L139

Computing that property can be done for RegExp (and we do it) however, adding the u flag transforms the RegExp into a function matcher pattern, which is no longer subject to the LONGER_ALT evaluation.

Fixing this in your own language is just a small matter of overriding the TokenBuilder and adding the token to the LONGER_ALT list manually. Fixing this on a generic level will be a bit more difficult.

Oct 29 '24 14:10 msujew

Thank you for the help!

For anyone else encountering this issue in the meantime, I solved it by following the idea and ended up with a new TokenBuilder class like this:

import { TokenType } from "chevrotain";
import { DefaultTokenBuilder } from "langium";

interface Keyword {
  value: string;
}

export class TokenBuilder extends DefaultTokenBuilder {

  override findLongerAlt(keyword: Keyword, terminalTokens: TokenType[]): TokenType[] {
    const result = super.findLongerAlt(keyword as any, terminalTokens);
    if (terminalTokens && keyword.value.match(/[A-Za-z]+/)) {
      const idToken = terminalTokens.find(it => it.name == "ID");
      if (!idToken) {
        throw new Error("ID token not found");
      }
      result.push(idToken);
    }
    return result;
  }
}

The Keyword interface is a workaround since I could not figure out how to access the real Keyword type used in findLongerAlt from the library. In my case, the terminal that uses the Unicode flag is named ID. I add it to the return value of super.findLongerAlt whenever the keyword consists only of letters. Theoretically I should not just use [A-Za-z] in the regular expression, but match on all Unicode characters, but since my keywords do not contain any non Unicode letters, this should be fine.

Nov 01 '24 13:11 c-classen

@c-classen The real Keyword type can be found on the GrammarAST namespace (importable from langium):

import type { GrammarAST } from "langium";

export class TokenBuilder extends DefaultTokenBuilder {
    protected override findLongerAlt(keyword: GrammarAST.Keyword, terminalTokens: TokenType[]): TokenType[] {
        // ...
    }
}

Nov 04 '24 12:11 aabounegm