HVM Replace parser, use LALRPOP parser generator

The current parser has several problems:

It's too verbose for users to read it as a language reference
It's complex, so hard to maintain
It uses a hand-written rather than an off-the-shelf parser combinator library, so is likely to have more bugs than average

This PR replaces the current parser with an implementation based on the LALRPOP parser generator.

The new parser is supposed to accept the same set of programs as the old parser, i.e. this PR does not change the syntax of HVM.

I have tested this with both the cargo build and tests/test_cli.py test suites.

Oct 07 '22 09:10 lawcho

Closing this for more work since I found some syntax not accepted by the new parser (or covered by the test suite):

Dots in identifiers, e.g. IO.get_line
String literals, e.g. Main = "Hello World"

Oct 11 '22 08:10 lawcho

I've fixed the bugs above (and added tests), so am re-opening this PR.

Oct 20 '22 14:10 lawcho

@lawcho I was wondering why you chose lalrpop and not treesitter?

Nov 15 '22 00:11 modulovalue

@modulovalue LALRPOP had some good reviews on reddit, so I tried that first and it stuck.

Nov 15 '22 10:11 lawcho

I'd like to present you some arguments for why you should consider tree sitter if you would consider alternatives.

tree sitter is based on a more powerful parsing theory, namely GLR parsing. GLR is able to offer substantially improved error messages because it maintains a tree of possible parses and when parsing fails, it is able to search through them for error recovery.
tree sitter is much more popular and offers bindings for a wide variety of languages. That way, others could implement their own tooling to work with the grammar.
github uses tree sitter for all the syntax highlighting that you see on github and tree sitter is partially funded by github. (I've compared the commit history for lalrpop with tree sitter and tree sitter does seem to be much more active).
and one of the more important arguments are, i think, is that generated treesitter grammars have essentially zero dependencies, other than the runtime of tree sitter itself. This would sort of future-proof the parser because lalrpop seems to have a bunch of them which could become a pain to keep up with in the future.

(now that I think about it, there are more benefits like e.g. tree-sitter offers incremental parsing, some editors such as neovim have chosen to add official support for tree sitter grammars, and probably others.)

Lalrpop is potentially more convenient, but I personally think that tree sitter would be a more stable solution that could foster more engagement from the wider programming language community.

@lawcho i'd love to hear your thoughts on this!

Nov 15 '22 15:11 modulovalue

@modulovalue The grammar is fairly simple, so porting it to tree-sitter (or most parser generators) should be easy.

As for the maintainability benefits of tree-sitter over LALRPOP, I am not convinced:

Dependencies/build complexity:
- This blog post says you need javascript to build tree-sitter parsers in Rust. The build workflow also seems quite manual.
- I'm not sure what dependencies LALRPOP has under the hood, but installation was as simple as adding two lines to the cargo.toml
Syntax: I find LALRPOP grammars easier to read than tree-sitter's syntax

These are not big issues, both tools should work.

In the short run, I'm letting this PR sit while HVM stabilizes (it just had a big rewrite).

In the long run, @VictorTaelin has more experience with the HVM codebase (and Rust projects in general), perhaps they can advise on which option will be more maintainable?

Nov 16 '22 10:11 lawcho

I was referring to these dependencies: https://crates.io/crates/lalrpop/0.19.8/dependencies most of them are prereleases and over half of them haven’t received an update in over six months.

I agree, Node is suboptimal, but V8 has major financial support from Google, so it is reliable to depend on it, and practically every company in the world uses Node and C so I have no worries about that.

so porting it to tree-sitter (or most parser generators) should be easy.

I agree completely, but porting is not the issue, it’s the versioning of grammars and keeping up with changes to it.

I see that you’d like to stick with lalrpop, which is fine, but given the engineering concerns that I’ve laid out plus the benefits of using tree sitter that you have left uncommented, I’d strongly recommend for you to reconsider your choice.

Nov 16 '22 15:11 modulovalue

I agree, Node is suboptimal, but V8 has major financial support from Google, so it is reliable to depend on it, and practically every company in the world uses Node and C so I have no worries about that.

@modulovalue I'm very much in favour of tree-sitter, especially because of the incremental parsing. But the point @lawcho made about Javascript is not about the reliability of Node but that it increases build complexity (a lot if compared with just adding a Rust lib as a dependency). I think if we separate it into another repo/crate it will just be fine, tho.

I experimented with it when I was drafting some syntax for Kind, but I didn't get to the point of using it from Rust. My main question would be if when the user runs cargo install hvm_parser it will Just Work without having to install Node or mess with tree-sitter at all.

Nov 16 '22 17:11 steinerkelvin

As far as needing JavaScript for tree-sitter, you only need JavaScript when you are modifying the parser grammar, and if we commit the generated C file to source control then people can still compile the Rust crate without JavaScript, though they will still need a C compiler to compile the Rust.

If you already need a C compiler to compile HVM, then that's not a big deal either way.

On the other side, though:

there are more benefits like e.g. tree-sitter offers incremental parsing, some editors such as neovim have chosen to add official support for tree sitter grammars, and probably others.)

tree sitter is much more popular and offers bindings for a wide variety of languages. That way, others could implement their own tooling to work with the grammar.

As far as helping other projects by using TreeSitter, there's nothing stopping somebody from making/porting whatever grammar we use to tree-sitter, especially when this grammar is so simple. 96 lines counting whitespace is tiny.

I think what's more important is that we use something that works well for us and for our needs for this project specifically.

github uses tree sitter for all the syntax highlighting that you see on github and tree sitter is partially funded by github. (I've compared the commit history for lalrpop with tree sitter and tree sitter does seem to be much more active).

While it's nice that tree-sitter has some backing and support, at the same time, this is just the parser. Tree sitter is great and designed amazingly for interactive editing experiences, but that's also not exactly what we're doing here. We just need a parser for the compiler.

is able to offer substantially improved error messages because it maintains a tree of possible parses and when parsing fails, it is able to search through them for error recovery.

I'm not sure about LALRPOP, but I know that the rust peg crate can also do this. The feature sounds like a good bonus, but not necessarily a deal-breaker either way.

This would sort of future-proof the parser because lalrpop seems to have a bunch of them which could become a pain to keep up with in the future.

Due to semver, we're really not the ones keeping up with it's depdendencies, it is. If it doesn't change it's API in a breaking way, we don't need to do any keeping up.

I was referring to these dependencies: https://crates.io/crates/lalrpop/0.19.8/dependencies most of them are prereleases and over half of them haven’t received an update in over six months.

Pre-release crates make up a huge portion of major crates foundational to tons of major Rust projects, and that's pretty normal. Frequency of updates is a good general metric, but at the same time, parsing isn't really something that needs to change a lot.

Anyway, I think whatever we go with it's not super-duper important, but I do think we should focus on our project and what works well for it.

If we look for an alternative parser that is just as convenient to use from Rust, I'd look at Pest. Here's a lines of code comparison between the two:

LALRPOP: 52k lines of code itself with 167k lines of code in 14 direct dependencies
Pest: 13k lines of code itself, with 13k lines of code in 1 required dependency and 5 optional dependencies.
Peg: 217 lines in the runtime 709 lines in the wrapper crate, and 0 other dependencies other than the macro, which isn't used at runtime.

Peg is awesome to use and my go-to parser in Rust projects. Pest is also very nice looking and has an advantage if you want a separate formal grammar definition. ( Peg is like a Rust+grammar language fusion, so it's not a standardized formal grammar in quite the same way ).

Nov 16 '22 17:11 zicklag

A quick comment on the other proposed alternatives:

@zicklag Pest and peg seem to use PEGs as their underlying parsing theory. PEGs are convenient, but only because they hide the complexity of writing grammars by implicitly disambiguating any ambiguities. (They also intermingle the lexing and parsing phases which, without careful consideration up-front, cannot be unmingled in the future without changing the language that is being accepted by the parser.)

Only depending on the least powerful parsing theory is absolutely necessary to remove parsing bottlenecks down the line. (Not removing bottlenecks for this project, but for everybody else who plans to support it). PEGs, In my opinion, should only be considered for prototypes that are not meant to interoperate with others at all.

[...] just Work without having to install Node or mess with tree-sitter at all.

@steinerkelvin It depends on how you'd decide to distribute the parser that was generated by tree-sitter. If cargo supports running custom build steps during cargo install, then no, I believe, you shouldn't have to worry about tree sitter at all. As zicklag noted, Node is only needed for generating the C file.

This project seems to prove that distributing the C file without much trouble is possible. And hiding the tree sitter dependency should therefore be possible as well.

We just need a parser for the compiler.

If we zoom out a bit, I believe, primarily you need users and funding to actually make good on the promises in your README i.e. to bring us to the parallel future of computers. By making it easier for others to interoperate with HVM (the IR side of HVM), you make it much easier for others to help you.

(Note: I'm not saying this because I'm a fan of tree sitter. I don't like it for many other technical reasons that are not relevant to this project, but, in my opinion, It appears to be the best option that we got today.)

Nov 16 '22 18:11 modulovalue

(They also intermingle the lexing and parsing phases which, without careful consideration up-front, cannot be unmingled in the future without changing the language that is being accepted by the parser.)

That's an excellent and valid point. :+1:

If we zoom out a bit, I believe, primarily you need users and funding to actually make good on the promises in your README i.e. to bring us to the parallel future of computers. By making it easier for others to interoperate with HVM (the IR side of HVM), you make it much easier for others to help you.

I will concede that as well, having a well-defined IR arguably more important than having a well defined front-end language grammar, because more programs will be generating that kind of language.

All-in-all, tree-sitter does seem like possibly the best current option, but it'd probably be best, of course, to get input from the maintainer before putting any more work into it.

Nov 16 '22 19:11 zicklag

I also agree 100% that making it easier for others to contribute to the project is of uttermost importance, which is why we've done such a big refactor to begin with. Files are now much better organized, the overall code quality increased significantly and I'm now working on the final public API and documentation in order to make it integrate smoothly with the Rust ecosystem. That said, there are some things that could still use some improvements, including the entire parser, the readback, the rulebook (including its flattener), and a bunch of small things here and there.

So, this is definitely the best time to swap the old parser by something simpler and stabler on the long term. I'm still not sure what the best approach is, but one thing that is very important is that the parser doesn't depend on complex external libs or increase the binary size significantly; one important aspect of the current parser is that the whole thing is just 2 Rust files that can't generate any long-term maintainability issue or headache, that doesn't stop working when compiled to WASM, and so on. Once I'm done with the API and documentation updates I'll come back to this PR, learn all that was discussed and we can decide together how to proceed.

Nov 17 '22 22:11 VictorTaelin

that doesn't stop working when compiled to WASM,

Ah, WASM compatibility might rule out tree-sitter actually. WASM doesn't have a stable solution for linking, so I'm pretty sure that's a non-option there.

To summarize the current comparison then:

Tree Sitter:
- Great power
- Good support
- The grammar is re-usable in other languages.
- Extra build setup:
  - Requires Node.js if you want to modify parser
  - Requires C compiler to build parser file
- Almost surely won't play nice for WASM because of the need for the linked C file.
LALROP:
- Already implemented in this PR
- Seems to work fine
- Has quite a large codebase and dependency set
Pest:
- Built on the PEG parser model which @modulovalue claims is not good for canonical parsers because of the mixing of lexing and parsing.
- Much less dependencies and smaller codebase than LALROP
Peg:
- Built on the PEG parser model which @modulovalue claims is not good for canonical parsers because of the mixing of lexing and parsing.
- Ultra teeny-tiny codebase and no dependencies
- Uses a Rust DSL that is great for easily building custom grammars, but less good for making that grammar easily understandable in other contexts.
Other options
- There are other parsers out there we could look at, but the ones already listed are some of the more popular Rust options.

Nov 17 '22 22:11 zicklag

Thanks for the summary, very helpful. So, yes, my understanding is that greatly decreasing the project's portability (to the point it wouldn't even work on WASM anymore) just to make the parser slightly more readable isn't a great tradeoff, specially considering the syntax is so simple and won't change much anyway. Peg looks like a nice option if it is so little, and if that'd make the project more readable and attractive, it could be a good investment.

Nov 17 '22 23:11 VictorTaelin

HVM HVM copied to clipboard

Replace parser, use LALRPOP parser generator

HVM
HVM copied to clipboard