Yuescript icon indicating copy to clipboard operation
Yuescript copied to clipboard

[Feature request] Additional parsing functions for the 'yue'-module

Open SkyyySi opened this issue 8 months ago • 3 comments

It would be nice to have the following functions be added to the yue module:

  • yue.to_tokens(code: string): string[]: Turns a chunk of YueScript code into a list of tokens, so local even_nums = [i for 1, 10 when (i % 2) == 0] would become something like [ "local", " ", "even_nums", " ", "=", " ", "[", "i", " ", "for", " ", "i", " ", "=", " ", "1", ",", " ", "10", " ", "when", " ", "(", "i", " ", "%", " ", "2", ")", " ", "==", " ", "0", "]" ]. This would be particularly useful for macros because the input might not make sense when parsed into a YueScript AST.
  • yue.to_cst(code: string): yue.CstNode: A CST (concrete syntax tree) is similar to an AST, but unlike an AST, it contains all information required to reconstruct the exact source code that it was generated from. An AST would discard information like whitespace and comments, while a CST would keep them. A CST should also ideally be a strict superset of an AST, so that a node visitor for an AST can also operate on the same node when used with a CST.

As a side note: It may be worth considering replacing the current macro semantics with something more like Rust's approach of taking a single list of tokens and returning a list of tokens, instead of working with source code. Doing it this way is easier to use as well as more reliable than trying to monkey-patch some strings with Lua patterns / regular expressions.

SkyyySi avatar Mar 18 '25 19:03 SkyyySi

YueScript uses a PEG parser so that its compiler don't have a tokenization phase while parsing, maybe we can implement a standalone tokenizer for your purpose (maybe won't take time done by an LLM AI?).

I like the Rust approach macro system with the paste crate. Maybe one can be done by tweaking the YueScript C++ PEG parser combinator with a switch to enable the extra macro identifiers and syntax, but it's a lot of work to do.

pigpigyyy avatar Mar 19 '25 14:03 pigpigyyy

I see, yeah that's going to be difficult. It might be easier to address the CST-parser first (which can be built by extending the existing AST parser). Since a CST contains everything needed to reconstruct the original source, it should be possible to get a list of tokens by simply flattening a CST.

Maybe it would also be possible to use the start and end information that the AST parser already provides? Assuming that it only discards whitespace and comments, we could try to just take the smallest possible slices that the AST mentions (usually, those would be leaf nodes) and use those to take slices of the original source code string. Everything left would then be either whitespace or comments, and the latter could easily be trimmed with a regular expression.

SkyyySi avatar Mar 19 '25 15:03 SkyyySi

@pigpigyyy Would it be a lot of work to make each parser node also capture its start and end index in the source code string? I think that should be a simpler way to reach this goal, since, from that, the source code tokens can fairly simply be reconstructed.

SkyyySi avatar Jun 12 '25 06:06 SkyyySi