"The parser"
It'd be worth covering
- [ ] where the lexer is
- [ ] how the parser works
- [ ] conventions around error recovery
- [ ] - What is this
eatfunction?- It consumes a token, and queues up the next token as lookahead
- [ ] How do you handle identifiers and keywords?
- [ ] What about constructing spans?
- (quoting Niko from below) I have to bring this back into cache, but in general there is this
last_spanthing that stores the span of the last token, and many things follow the idiom of saving the "lo" point of the span, parsing some stuff, then extracting the "hi" point and combining them. This would be used to make a span that encompasses, for example, an entiretraitdefinition (the lo point would come from thetraitkeyword, but the end point comes after having parsed a bunch of other things).
- (quoting Niko from below) I have to bring this back into cache, but in general there is this
- [ ] What is this crazy pretty printer?
- It dumps out the AST in human readable form.
- We have tests that will parse each file in run-pass, pretty print it, and then test that the resulting pretty printed file type checks, so you have to implement this.
- [ ] Something like this: https://github.com/rust-lang-nursery/rustc-guide/issues/13#issuecomment-361848936 and https://github.com/rust-lang-nursery/rustc-guide/issues/13#issuecomment-361992949
I'd be interested in helping out with the parser section. I haven't made any contributions to the Rust parser specifically, but I've played around with syn and proc macros and also done a fair amount of parsing for other projects (toy projects, custom file formats at work, etc).
Do you know if there's anyone with experience in the parser that I'd be able to bounce questions off?
@Michael-F-Bryan you can certainly bounce questions off of me; I would say the main owner of the parser is @petrochenkov
Copying some stuff I wrote in #21 that seems relevant:
What things would someone need to know when either working on the parser or trying to use the parser
Hmm, good question. I imagine it'd be good to look at a few PRs that modified the parser to see what they had to do. Actually, it might be worth just including a link to those PRs in the guide to help people get oriented.
rust-lang/rust#45047 is such a PR, and here is a link to some of the relevant content.
Some things immediately jump out as worth explaining:
- What is this
eatfunction?- It consumes a token, and queues up the next token as lookahead
- How do you handle identifiers and keywords?
- Actually, this PR didn't have to do anything around that. I'd have to find a good example (cc @petrochenkov -- thoughts?)
- What about constructing spans?
- I have to bring this back into cache, but in general there is this
last_spanthing that stores the span of the last token, and many things follow the idiom of saving the "lo" point of the span, parsing some stuff, then extracting the "hi" point and combining them. This would be used to make a span that encompasses, for example, an entiretraitdefinition (the lo point would come from thetraitkeyword, but the end point comes after having parsed a bunch of other things).
- I have to bring this back into cache, but in general there is this
- What is this crazy pretty printer?
- It dumps out the AST in human readable form.
- We have tests that will parse each file in run-pass, pretty print it, and then test that the resulting pretty printed file type checks, so you have to implement this.
I'm surprised how familiar a lot of this is! Quite a lot of parsers for toy languages I've written in Rust are effectively a state machine which incrementally consumes tokens (eat()). Your parser keeps track of where it's at (last_span) and then uses that to create a Span so you can figure out where an AST node originally came from. The "union" operation for joining spans is then a logical extension of this for when you compose larger nodes (e.g. a function body) from a bunch of smaller ones (each statement in the function).
I'm interested to find out how libsyntax deals with errors and tries to typecheck bits of your program even if you've got a syntax error elsewhere. My understanding is you break the tokens up into token trees, allowing you to "checkpoint" your progress at known points. Then if one particular token tree fails you still try to pass the rest of the (incomplete) AST to later stages... Is that somewhat close to how rustc does things?
@Michael-F-Bryan
I'm surprised how familiar a lot of this is!
Yeah, it's your basic recursive descent parser. Nothing too fancy. =)
I'm interested to find out how libsyntax deals with errors and tries to typecheck bits of your program even if you've got a syntax error elsewhere.
This is somewhat ad-hoc, but actually one of the things I would like documented. @nrc is the one who put in a lot of that, and they may be able to comment a bit, though I think @estebank has also put in some effort here. One particular pattern I recall that @estebank has used a few times is that they "checkpoint" the state of the parser and run "tentative" parses that are then discarded -- this is used to give better error messages, basically saying something like "it looks you were writing a fn here, you need to add this keyword".
In short, though, we make a 'best effort' to recover and build up some sort of tree. We wind up reporting an error but then passing this tree back up to the rest of the code, which continues as usual. I'm not sure if there are any kind of "error nodes" in the AST, I didn't see any in a quick search, but I would expect that to appear -- we certainly use that later on. e.g. the HIR (constructed from the AST, not directly from the parser) has a node TyErr that represents an erroneous type.
I'm moving across a conversation I was having with @mark-i-m about how much we want to explain parser basics.
Should the parsing chapter go into more detail about lexing and parsing or rely on external sources for the basics?
I was asking myself this exact question when I wrote the start of the parser chapter. Should we add a small note up the top saying we assume people know how a basic recursive descent parser works and what tokenizing/lexical analysis is? The idea being this is a book about rustc internals, not a book an introduction to parsing.
There is already loads of good quality material on basic parsers on the internet, a couple paragraphs at the top of the chapter probably wouldn't be able to do it justice.
I agree that we shouldn't try to teach parsing here, but given that I don't expect most people to know basic parsing, I worry that it would discourage contributions... Perhaps we can
- Add a high level overview of the algorithm and point to a few solid resources for learning in detail
- Give some key term definitions
- Tie them all back to the code
What do you think?
Sounds like a good idea. I was thinking we could say something like this:
Rust syntax is specified by a grammar (link) which is essentially a list of rules where each rule specifies how a piece of the language is written (e.g. a crate contains multiple items, an item may be a function declaration, a function has a bunch of statements, a statement is a ...), with each rule being written in terms of other rules or terminals (the base case, typically tokens).
Generally speaking, for each grammar rule there will be one parser method. In this way we can translate a token stream into an AST by recursively calling the appropriate method.
For people who already know how parsers work it's essentially recursive descent 101, but something like that is probably complete enough to give people the general idea, but short enough to not be off topic.
You could then tie all of this back to rustc by inspecting a sample code snippet (e.g. an if statement) and then showing what would be called when parsing it. This probably also shows how you deal with ParseSess and CodeMaps.
Copying over Niko's comment from #26 :
I agree that we shouldn't try to teach parsing here, but given that I don't expect most people to know basic parsing, I worry that it would discourage contributions... Perhaps we can
I think I agree with both of you. I don't think we want a lot of introductory material; a few links don't hurt, but not too much. But I think there's a third way, though it may take some iteration to get there: To some extent, I think you can serve both audiences by doing a kind of "walk through" of the code.
In other words, e.g. to explain tokenizing, we might point to the token data structure and give some source showing how it would be divided into tokens (we can always link to wikipedia or something too). This way, if you know what a token is, you learn about the Rust-specific parts of it. If you don't know what a token is, you can just understand it as this Rsut data structure and later learn about the more general form.
Similarly I imagine we can say something like "Rust has a recursive-descent parser" (where we link to wikipedia) and then walk through how it would parse some small example, showing a few key functions (eg., the one that parses a type). If you're not familiar with recursive descent, this will basically give you the idea, but if you are, then you'll learn about the names of key concepts in the code.
Triaged: not a whole lot of change. There was some discussion earlier with @chrissimpkins, @matklad and @Centril (who is taking some time off atm). Perhaps Chris can give an update?
Triaged: not a whole lot of change. There was some discussion earlier with @chrissimpkins, @matklad and @Centril (who is taking some time off atm). Perhaps Chris can give an update?
Definitely still planned. I am digging into the parser source and just submitted my first parser-area PR this week. :)
Once I understand the source well enough I will get started on the chapter if someone else does not get to it first. The conversations that I held with Aleksey and Mazdak are in our Zulip channel for anyone interested in this information. They informed the rustc-dev-guide Overview section and there should be plenty of information in there to get a start on the main lexer/parser chapter.
@chrissimpkins This discussion right? https://rust-lang.zulipchat.com/#narrow/stream/196385-t-compiler.2Fwg-rustc-dev-guide/topic/Parser.20Overview.20and.20Guide.20updates/near/192620162
It looks like the conversation with Aleksey was an impromptu discussion on Apr 7, 2020 that occurred in a private thread. I copied the text below.
Details
Hi Aleksey! Our WG is working on a rustc Overview document for the rustc-dev-guide and I am interested in updating the lexer documentation. Do you happen to have some time in the next week that we could use to get together for 10 mins or so to discuss the lexer?
matklad12:33 PM Yup!
12:33 PM I actually have time right now, if that's convenient
Chris Simpkins12:34 PM Sure, that would be great!
12:34 PM and thanks!
12:35 PM One sec and I will link in where we are with the Overview draft. This will be a high level summary that lives at the beginning of the Guide. I'd like to dive into more details on the lexer (and parser, I met with @centril last week to discuss parsing) and update the lexing/parsing chapter too
12:36 PM This is the current state of my draft of the lexer/parser information: https://github.com/rust-lang/rustc-dev-guide/pull/671
12:36 PM Sorry wrong PR. https://github.com/rust-lang/rustc-dev-guide/pull/669 (#669)
12:38 PM I'd like to understand the two levels of the lexer, a bit about the error handling that happens at the lexer stage, and where the best source entry points are for the documentation.
matklad12:39 PM Sure!
12:39 PM So, the main confusing point I imagine is "WTF Y 2 lexers?"
12:40 PM The rustc_lexer crate is a relatively new addition
12:40 PM the lexer/mod historically was the single lexer
12:41 PM the problem with lexer/mod is that it did two things simultaneously
Chris Simpkins12:41 PM There are comments about the second stage lexer leading to a "richer" token stream. Can you expand on what that means, relevance to the parser?
matklad12:42 PM breaking the input stream into chunks (like, chars from 10 to 200 are a string literal) 12:42 PM turning string chunks into "lowered" tokens. Like turning a "100u32" string into a 100 number 12:43 PM This second transformation turned out to be problmenatic, as it erased some information from the original source
12:44 PM (like, 100 and 1_00 are indistinguishable after it)
12:44 PM And we want to preseve full-fidelity info for both IDEs and proc macros
12:45 PM So, rustc_lexer was extracted to deal solely with the task of breaking a string into tokens. This crate is now shared between rust-analyzer and rustc
12:45 PM The lexer/mod is still there, mostly to maintain the old "rich tokens" interface and avoid rewriting the whole parser
12:46 PM But, in the ideal world, we get rid of lexer/mod and make the parser work directly with the tokens produced by rustc_lexer.
Chris Simpkins12:46 PM Got it. That history is very helpful!
12:47 PM Is there any work in the direction of deprecating that source, desire to point those who might wander across the docs to get involved in work on making a transition like that happen?
matklad12:50 PM Kinda, but it's not really well mentored :(
12:50 PM In particular, https://github.com/rust-lang/rust/issues/63689 is part of the long-term vision
Chris Simpkins12:50 PM OK, np. Just wanted to see if we can help to push hands in that direction through docs if it is desirable
12:52 PM As I understand the call path issue, rustc_interface calls into the parsing code and the instantiation of the Parser struct is where the lexer is instantiated and StringReader emits tokens. Is that a fair assessment/understanding?
12:52 PM of the connection between lexer / parser and the rustc executable
matklad12:52 PM I think that has been restructured since I've last looked at that code
Chris Simpkins12:53 PM Is there a good lexer entry point that we can use for the documentation, and are there more than one that are relevant to document?
matklad12:54 PM for rustc_lexer, the whole interface is more-or-less this function: https://github.com/rust-lang/rust/blob/39b62533c7f9d0581a6ea9b9fc2cc51f21c3b5b0/src/librustc_lexer/src/lib.rs#L246
Chris Simpkins12:54 PM Perfect! that is very helpful :slight_smile:
12:55 PM and my last question, anything special about error handling that you feel that we should mention in the Overview? I will work on the Guide chapter later and we can go into more detail. High level documentation needs for the lexer?
matklad12:57 PM Not really. Maybe just that rustc_lexer tries to have a small interface (so that it's easier to share with rust-analyzer), so it doens't depend directly on the diagnostic infra in rustc, and instead just provides diagnostics as plain-old-data, which are then emmited in lexer/mod as real diagnostics
Chris Simpkins12:59 PM This has been incredibly helpful! I really appreciate your time. Thank you very much.
@chrissimpkins This discussion right? rust-lang.zulipchat.com/#narrow/stream/196385-t-compiler.2Fwg-rustc-dev-guide/topic/Parser.20Overview.20and.20Guide.20updates/near/192620162
correct!