R icon indicating copy to clipboard operation
R copied to clipboard

Could the parser be used in other projects, such as a formatter?

Open etiennebacher opened this issue 1 year ago • 19 comments

Hi, this project looks really nice!

I'm trying to build a fast linter/formatter for R, ideally a ruff equivalent. I've been working on flint, which uses tree-sitter under the hood (well, technically it uses ast-grep which itself relies on tree-sitter). This works quite well for a lot of linter rules but I run into some issues when I need to access the environment (for example to know whether a package is loaded or not), which is not possible with tree-sitter.

I know that Ruff uses RustPython to parse all the Python code in Rust and then they build their linter and formatter rules, so I was wondering if the same was possible in R. It seems that in this project you do all the steps from parsing to evaluating. I'm only interested in the parsing part, and I'm struggling a bit trying to make a proof-of-concept. I see all the parse_ functions in the crate but I don't really know how to use them and I didn't find any examples. Could you share some examples of how to parse the R code into tokens in Rust? (Just to clarify, I'm a beginner in Rust so I might be missing something obvious.)

The tests I've seen mostly check the output after evaluating the code (this makes sense, this project advertises an R interpreter, not only a parser). Also, before spending more time on this, I'd simply like to know your opinion about whether the parser in this project could be used to build a linter/formatter.

Anyway, good luck for the development of this project :)

etiennebacher avatar Sep 07 '24 15:09 etiennebacher

Absolutely! I've been considering making a detour to do exactly that, so please let me know what I can do to help if anything is feeling more tedious than you'd like.

This parser should parse most R code, but since this project takes some liberty with supported syntax we might want a separate project to be a dedicated R parser. The process of using this parser is also probably needlessly complicated for your needs because it also supports parsing with localized keyword translations.

Since this is an R-like dialect, the parser will not support

  • lambda syntax (\(x) x, which in this dialect just uses fn(x) x)
  • formulas
  • symbols have a more restrictive set of permissible characters before they require being wrapped in `backticks`

I'm sure there are plenty of edge cases I'm forgetting. I've considered making true R a parsing mode, but for now it is not completely covered.

All that is to say I'm happy to draft an example to get you started, but if you decide to go this route, it might be better to split off the parser into a separate crate - I'd be happy to help handle all of R's syntax.

dgkf avatar Sep 07 '24 16:09 dgkf

Thanks for the detailed answer!

All that is to say I'm happy to draft an example to get you started, but if you decide to go this route, it might be better to split off the parser into a separate crate - I'd be happy to help handle all of R's syntax.

Great! Just to clarify, I won't have time to work on the parser anyway or anything related to it in the next few weeks so this post was mostly to know whether or not the parser could exist in theory. That being said, it would be great to have this example (might also interest other people) and then I'll come back to that when I have more time.

since this project takes some liberty with supported syntax we might want a separate project to be a dedicated R parse

That's what I had in mind but I don't know how much work it would be to extract the parser in an external crate so didn't want to ask that right away.

Since this is an R-like dialect, the parser will not support

Just to clarify, do you mean the current parser (that has some experimental features) or the "pure R" parser that could be created?

etiennebacher avatar Sep 07 '24 16:09 etiennebacher

Just to clarify, do you mean the current parser (that has some experimental features) or the "pure R" parser that could be created?

Just the current parser for this project. I'd be happy to help support a full R parser as a separate crate. I've been considering splitting this project into a multi-crate workspace for a while, but even then I think it would be trying to accomplish too much to be both the parser for this and real R (I really need to rebrand because this gets really ambiguous real fast).

I'll get an example going that shows how to extract just the parse information, but I think the most actionable step after that will be to kick off a new crate, possibly using this work as a starting point.

I think trying to serve both projects with the same crate is going to make life hard for both projects. At some point, this project could use a real R parser and then translate the parse information into its own representations, but to try to do both at once would demand too much of the parser in my opinion.

dgkf avatar Sep 07 '24 16:09 dgkf

Added a examples/parsing.rs that shows a few ways that we manage parsing.

The first example is the best for programmatic use. We use the macros for writing test cases.

dgkf avatar Sep 07 '24 17:09 dgkf

Awesome, thanks a lot for the super quick answers!

etiennebacher avatar Sep 07 '24 17:09 etiennebacher

Just a heads up, I split this off into a separate project over in dgkf/rfmt.

It was surprisingly easy to strip it back down to something more targeted to the R syntax. It's really just the grammar.pest file and derived RParser struct that get auto-generated by pest.

There's an example for using it in rfmt/examples/parsing.rs.

The project doesn't need to live here, it was just the easiest place to start. Happy to migrate it if you had a different home in mind. My only stipulation is to comply with the GPL license by using a similarly copyleft license if you plan to reuse the code. I hope that's okay for you.

dgkf avatar Sep 08 '24 23:09 dgkf

Great, thanks. I don't have another place to host it in mind :)

I'm just wondering why the GPL requirement? I don't have anything specific against it, I'm not too familiar of the differences between licenses. I usually stick to the MIT

etiennebacher avatar Sep 09 '24 06:09 etiennebacher

@etiennebacher Just fyi: When I was talking to the folks from the positrom team from posit they mentioned that they are also working on a new formatter.

sebffischer avatar Sep 09 '24 07:09 sebffischer

@sebffischer good to know, thanks! I'll stick to the dev of flint in the meantime then. Is there an issue you can refer me to or was it just some private talk?

etiennebacher avatar Sep 09 '24 08:09 etiennebacher

@sebffischer good to know, thanks! I'll stick to the dev of flint in the meantime then. Is there an issue you can refer me to or was it just some private talk?

Unfortunately, this was private talk after their presentation at UseR in Salzburg, so I don't know whether there are any open issues on GitHub. To be specific, I was chatting with Davis Vaughan, so maybe you can ping him on mastodon or somewhere.

sebffischer avatar Sep 09 '24 08:09 sebffischer

@etiennebacher

I'm just wondering why the GPL requirement? I don't have anything specific against it, I'm not too familiar of the differences between licenses. I usually stick to the MIT

The GPL adds a protection against private forks of code. If someone were to take the code and add their own private enhancements, they'd be obligated to share the code upon request. Even though the R community is generally very willing to share, it's a good safeguard against the possibility that an actor could tweak the code for their own gains without giving back. MIT would allow this. In super simplified terms, MIT is more of a "do whatever you want", whereas GPL is like, "you're free to use it, but please play nice".

The GPL requires that any derivative works also take a compatible license, so by nature of this work being GPL, any work that reuses its code would have to keep the GPL license (at the very least for the parts that are reused, but preferably for the whole project).

R itself is GPL, so it's definitely not unfamiliar to the R ecosystem. It's a great license for these types of projects that provide broadly useful functionality that we want to keep open source.

dgkf avatar Sep 09 '24 13:09 dgkf

When I was talking to the folks from the positron team from posit they mentioned that they are also working on a new formatter.

@DavisVaughan @lionel- Any advice on whether a formatter/lintr in rust is something we should jump on, or do you have something in the works already?

dgkf avatar Sep 09 '24 13:09 dgkf

@etiennebacher you may be interested in this {reflow} package experiment I put together at one point.

I explored this as an alternative architecture to lintr. Instead of treating the AST like a series of tokens or XML doc like lintr, it uses AST pattern matching to emit lints when patterns are not satisfied. I found it really promising because it all happens in a single walk of the AST, it was easier to implement alternative paths, and was able to raise meaningful contextual information and restarting linting (or theoretically formatting corrections).

It made it much easier to build expressive rules, but ended up being prohibitively slow due to the way that R's condition interrupts work - it was largely inspired by rust's error handling, but the idea just doesn't map well into R. This style would probably be much more effective in native rust.

Theoretically, the condition interrupts could be refactored away into a big loop, but I didn't feel like such an implementation would provide meaningful benefit over the way lintr does things.

dgkf avatar Sep 11 '24 18:09 dgkf

Oh yeah that looks really cool. I'm swamped at work right now but I hope I can get a better look at this in October

etiennebacher avatar Sep 11 '24 19:09 etiennebacher

Hi Doug,

We do plan to implement a Wadler formatter in Ark. Architecturally it's likely to be very similar to the Rome / Biome formatter (which was forked by the ruff team).

  • We will transform our tree-sitter tree to Rowan data structures (https://github.com/rust-analyzer/rowan). Eventually all analysis tasks in Ark will be implemented on top of Rowan, insulating us from the Tree-sitter API.

  • We hope to be able to reuse infrastructure from the biome project for our formatter, both for generating the Wadler IR, and to do the formatting itself. Many projects have managed to reuse common infrastructure for different languages or dialects. It would be interesting to see what could be done here eventually, in terms of supporting both R and your dialect.

  • We don't plan on writing a parser right now but down the line it will likely make sense to write one along the lines of https://matklad.github.io/2020/04/13/simple-but-powerful-pratt-parsing.html. Our main goal, besides efficiency, would be better error recovery and syntax error diagnostics than what we currently have with tree-sitter.

It may be worth discussing if there could be further synergies between our projects. For instance regarding the new parser to generate rowan data structures. That said, Ark is an MIT project and must remain entirely MIT so we unfortunately wouldn't be able to integrate GPL components in Ark.

lionel- avatar Sep 12 '24 17:09 lionel-

Thanks @lionel-, great background info. rowan even has an example for S-expressions that seems like it would be reasonably straightforward to adopt as the target for R.

This parser uses pest, which provides a PrattParser with diagnostics, so we can jump straight to an AST-binding rowan interface.

The license piece is the trickiest roadblock. Do you see it as being necessarily part of Ark? I had envisioned it as being exposed to R via an extendr R package. At that point, code actions and diagnostic information is just communicated to the language server via socket/stdio. Do you foresee necessary features that require tighter coupling?

dgkf avatar Sep 12 '24 17:09 dgkf

There was a great talk on rustfmt at rustconf yesterday and I just wanted to jot down a few of their recommendations:

  • don't use tokens or ast; use an intermediate that captures info necessary for both
  • reduce optional formats (not clear if they meant alternative "correct" styles or user-facing options to enable different styles. I think they meant the former)
  • be more tolerant to code that can violate rules (they give line length as an example, saying some lines just don't make sense to try to break up into a line width).

There's a brief discussion on the difficulties of handling macros, since they get the (post macro expansion) rust AST. Not exactly a problem we'd have in R, but we would have to deal with NSE code which falls into a similarly tricky spot.

dgkf avatar Sep 12 '24 18:09 dgkf

I suppose you're both already familiar with this but I found this blog post super interesting: https://journal.stuffwithstuff.com/2015/09/08/the-hardest-program-ive-ever-written/

Your third point is addressed at the end of the post.

we would have to deal with NSE code which falls into a similarly tricky spot.

I guess an example of this would be to avoid breaking dplyr's {{ x }} in several lines.

Rust (and other languages') formatter looks like magic to me so gradually learning about all those details is fascinating, thanks!

etiennebacher avatar Sep 12 '24 18:09 etiennebacher

Do you see it as being necessarily part of Ark? I had envisioned it as being exposed to R via an extendr R package.

yep it is an essential part of Ark. Almost everything in the LSP and formatter relies on the syntax tree generated by the parser (currently tree-sitter). And more to come (e.g. semantic tree with symbol tables).

we would have to deal with NSE code which falls into a similarly tricky spot.

I guess an example of this would be to avoid breaking dplyr's {{ x }} in several lines.

Good points. But this brings us to a tricky interaction between semantic analysis and syntactic analysis. It feels like that ideally the formatter would work on single files and pieces of code without knowledge of the surrounding environment and would produce the same results for all syntactically equivalent code.

For semantic analysis our first goal is to deal with NSE and make it possible for packages to declare the scoping semantics of their functions with annotations. This way we know that local() or test_that() have an argument evaluated in a local scope, which is necessary to correctly analyse code (e.g. figuring out symbol references in { x <- 1; local(x <- ""); x }.

Thanks for the pointer to the rustfmt talk, we'll check it out!

lionel- avatar Sep 13 '24 06:09 lionel-

Closing this since air is now released. Worth noting that there's another recent R formatter written in Rust: tergo

etiennebacher avatar Feb 25 '25 12:02 etiennebacher