redscript icon indicating copy to clipboard operation
redscript copied to clipboard

Error recovery using lexer parser separation

Open ProphetLamb opened this issue 3 years ago • 11 comments

Currently, the compilation is unable to produce more than one error, even if this is a recoverable error. This PR implements a diagnostic pipeline that allows definition of arbitrary diagnostics and logging during paring.

Error recovery using a single stage parser is not really feasible. So we split between parser and lexer. The lexer allocates memory when unescaping string literals, otherwise only a stream of slices is produced. I am currently looking into if error recovery is possible in PEG without producing a bad grammar. For now, all solution I could find are ugly, because PEG doesn't allow us to access the underlying stream. So I am probably going to port the Backus Naur like PEG syntax to a parser combinator (nom).

How to report diagnostics

the diag_report! macro accepts either a Span or a Range<Span> to report an error. Diagnostic messages must be predefined, as constants.

fn str_char_invalid(is: Span) -> IResult<Option<char>> {
    let (i, c) = preceded(char('\\'), anychar)(is.clone())?;
    diag_report!((&is..&i), ERR_INVALID_ESCAPE, c);
    Ok((i, None))
}

diag produces error messages similar to rust.

error[ELN0001]: Invalid integer `02439853427592345284395923845320459823457324953247640359104519845834634538`, number too large to fit in target type
--> test.reds:1:0 to l1:74
| 02439853427592345284395923845320459823457324953247640359104519845834634538

TODO

  • [x] Is PEG Error Recovery possible ? implement : port to nom
  • [ ] Actually integrate the ported AST into the compiler pipeline
  • [ ] Dont use a boxed AST, instead use a pooled AST via context references. (needed for syntax receivers for additional diagnostics and quick fixes)

ProphetLamb avatar Nov 14 '22 16:11 ProphetLamb

Hi, this looks like a nice addition! Some suggestions/ideas:

  • I haven't used nom much, so I don't know how difficult it'd be, but if we're introducing a new parsing lib, I'd rather have it replace peg entirely rather than mix the two to avoid bloat.
  • I'd avoid changing the AST to prevent this from spilling all over the codebase. I think this change should focus on the parser only and remain compatible with the old AST. If a new AST is really crucial, you can add the new parser in a new workspace crate and provide a conversion into the existing AST, so that a potential switch can be made in steps (and maybe old AST can be abandoned one day if it works out).

Your change coincided with me switching the main branch to one with preliminary work for the next minor release, which is mostly not tested yet, apologies for that!

jac3km4 avatar Nov 14 '22 20:11 jac3km4

@jac3km4 The changes to the AST mostly refer to replacing Box slices with vectors created using bump allocation. Bump allocation creates a lifetime buffer that you can create a vector in with the similar speed as a stack alloc, bc it is effectively a stack alloc. Growing a vec that is not on top of this bump stack, moves it to the heap, without clearing the reserved space. This is the best case for allocation a parse tree as flat as possible

https://github.com/jac3km4/redscript/pull/80/commits/71a7580ef14685f23261c7965ff10af8eb7b4887#diff-d951465f240179b5ea43bf25dbc628cb985688c0574305fa4fc71a3eae81a00dR8-R19

You can see this illustrated in the example rule linked. Note that these are allocator vectors from bumpalo::collections::Vec, and have the same lifetime as the parse tree.

Alternatively, we can use the allocator api for std::vec::Vec instead, the bumpalo vec is just a reexport of a specific version of the std lib known to work.

ProphetLamb avatar Nov 15 '22 20:11 ProphetLamb

@ProphetLamb I've experimented with using a pooled approach early on in the project and I found there was no discernable difference in the CPU profile, it's far from being a bottleneck, and it was more complicated so I dropped the idea

jac3km4 avatar Nov 15 '22 20:11 jac3km4

@ProphetLamb I've experimented with using a pooled approach early on in the project and I found there was no discernable difference in the CPU profile, it's far from being a bottleneck, and it was more complicated so I dropped the idea

Luckily, bumpallo does all the work for us. And Box<[T<'ast>]> should be entirely replaceable with Vec<'ast, T<'ast>>. Using a vec also allows for syntax analyzers modifying the ast post parse for features like metaprogramming, instead of hardcoded ModuleExists annotations, bc we can allocate into 'ast at any time.

ProphetLamb avatar Nov 15 '22 20:11 ProphetLamb

@ProphetLamb I think bumpalo is the one I've tried, but I can't remember at the moment - I'd be open to this kind of change if it's shown to make a difference in performance, but that's not what I've seen

jac3km4 avatar Nov 15 '22 20:11 jac3km4

Interesting. I have transitioned the script lang for the business logic used in the company I work at to pooled allocation, we got around 30% lower compilation times. But that one was written in C++. Maybe rust has just overall better at memory mgmt, idk

ProphetLamb avatar Nov 15 '22 21:11 ProphetLamb

At the moment the compiler is pretty highly IO-bound because it reads and writes pretty big cache files. These files contain bytecode for ~500k LoC written by the original game devs. I think there are a number of optimizations that could be applied there to significantly cut the compilation time. Maybe then this would be more easily visible in the profile. I'd be open to bring in bump allocation, but I think it needs to be based on some observable benefit, because it does complicate the code a little bit in many places.

jac3km4 avatar Nov 15 '22 21:11 jac3km4

That totally makes sense. If it does complicate things it is better to use the native approach, tho this is not the case using nom parsers, since they allow accessing the underlying stream context

ProphetLamb avatar Nov 15 '22 21:11 ProphetLamb

I think bump allocation in the code that parses the binaries could be worth a shot, and maybe zero-copy parsing strings, which I avoided initially because the data loaded from these files gets mutated and mixed in with data produced by the compiler, but there are ways around that like using Cow But IMO all of this is out of scope for the change you're working on and should be considered as a separate thing

jac3km4 avatar Nov 15 '22 21:11 jac3km4

That is a good point for maybe another PR. Should I create an issue for that?

ProphetLamb avatar Nov 15 '22 21:11 ProphetLamb

sure!

jac3km4 avatar Nov 15 '22 21:11 jac3km4