langium icon indicating copy to clipboard operation
langium copied to clipboard

Improve lexing for terminals

Open spoenemann opened this issue 3 years ago • 3 comments

The default lexer of Chevrotain takes the first match according to the order in which the terminals are defined. This is different to the default lexer of ANTLR, which takes the longest match (IIRC).

The main problem with this is that a terminal can be shadowed by another terminal defined further above in the langium grammar, and you won't get any feedback about this in the editor nor in the CLI. We should think about how to improve the user experience here: add static analysis of terminals to the grammar validator? To the CLI? Switch to a different lexer?

spoenemann avatar Nov 03 '21 10:11 spoenemann

Would you guys be willing to change the parser and base it on a Typescript implementation of ANTLR 4?

I was just in a discussion with the creator of a4tstool about this.

IMHO, ANTLR 4 is a very powerful parser generator and the combination of it with Langium would be awesome!

It generates ALL(*) parsers here's it's paper, but I believe it's abstract may have enough to consider this option:

Abstract

Despite the advances made by modern parsing strategies such as PEG, LL(*), GLR, and GLL, parsing is not a solved problem. Existing approaches suffer from a number of weaknesses, including difficulties supporting side-effecting embedded actions, slow and/or unpredictable performance, and counterintuitive matching strategies.

This paper introduces the ALL(*) parsing strategy that combines the simplicity, efficiency, and predictability of conventional top-down LL(k) parsers with the power of a GLR-like mechanism to make parsing decisions.

The critical innovation is to move grammar analysis to parse-time, which lets ALL(*) handle any non-left-recursive context-free grammar. ALL(*) is O(n4) in theory but consistently performs linearly on grammars used in practice, outperforming general strategies such as GLL and GLR by orders of magnitude. ANTLR 4 generates ALL(*) parsers and supports direct left-recursion through grammar rewriting. Widespread ANTLR 4 use (5000 downloads/month in 2013) provides evidence that ALL(*) is effective for a wide variety of applications

svallory avatar Jan 13 '23 21:01 svallory

@svallory thanks for the input. I wrote a blog post on this topic recently that goes in depth on why we chose Chevrotain for Langium instead of ANLTR4.

TL;DR: It doesn't require an installed Java runtime, it's way faster and for 99% (in our experience) of grammars Chevrotains behavior is "good enough". Note that Langium allows to exchange the lexer implementation for a different one already by overriding the Lexer service. We plan to do the same in the future for the parser.

msujew avatar Jan 13 '23 22:01 msujew

@msujew I wouldn't use or depend on Java either :) That's the whole point of antlr 4 typescript tool...

The port is a self-contained tool, which does not need anything beside Node.js.

It's great to hear that you guys will make the parse exchangeable. The grammar for the language I'm designing would be very hard (if possible) to write in LL(k).

While that is not available directly from Langium, would it be possible to plug another parser by writing an adaptor?

EDIT:

I just checked the article and its title mentions ALL(*) in Chevrotain. Are there any advantages ANTLR 4 would provide considering that? (I'm new to this whole thing)

svallory avatar Jan 14 '23 04:01 svallory