chumsky
chumsky copied to clipboard
Example of Python-like parsing
Having an example on how to parse Python-like languages that are aware of indentation would be interesting.
Python's parser/lexer (whichever processes identation, I've not checked) is not context-free because correctly parsing each line relies on the contextual information available in previous lines (i.e: lines 'pair up' with neighbouring lines into blocks if their indentation matches).
Chumsky (and virtually all parser combinators and parser generators) does not play well with context-sensitive grammars. It's usually necessary to hand-write a context-sensitive parser for such languages.
All is not lost, however! A neat trick you can use is to write your lexer by hand and have it translate Pythonic indentation into traditional code blocks with delimiters. Once this is done, the tokens can be fed into Chumsky and parsed in a context-free manner.
I would actually advise you use this strategy: get rid of the context-sensitive indentation as early as you possibly can in the compiler, it makes parsing (and representing the AST later on) needlessly complex.
I've actually written an example lexer that does this in the past, complete with correct handling of parentheses and newlines. You can take a look at it here. Once you've turned the text into a stream of tokens, they're pretty easy to feed to Chumsky.
It's probably possible to implement something similar as custom parser for Chumsky too, although it's obviously a bit ugly for the reasons I mentioned above. I'll look into this a little more to see if it's possible.
Thank you for the detailed answer and the bonus example! Really appreciate it.
Something I have seen before is introducing special INDENT
and OUTDENT
tokens in your lexer. I thought that was quite clever.
Something I have seen before is introducing special
INDENT
andOUTDENT
tokens in your lexer. I thought that was quite clever.
Yep, that's how I've seen a lot of lexer/parser combinations do it. Although I find it then hard to work with languages (like Python) that use newlines as a separator and as a whitespace.
@zesterer Apologies for the bump after two weeks, but I'm curious.
I'll look into this a little more to see if it's possible.
If it turns out not to be possible, could an indent-aware parsing example still be included in the /examples/
directory? That way, users would know how they might make Chumsky work with indent-aware content, even if it requires some hand-written code on their end.
I've had a think about it and although I believe that it should be possible to implement, I don't think that any implementation I could add to Chumsky would be sufficiently flexible to cover all use-cases.
In terms of a hand-written example implementation, I'm very happy to accept something like this if someone is interested in implementing it. Right now, unfortunately, I'm focusing on a number of other things so I likely won't get time to work on it until after New Year.
I've actually made some progress on this! master
now contains a parser called semantic_whitespace
that can be used to tokenise Python-style syntax into token trees, respecting semantic indentation. You can find an example of its use here.
I'd really like feedback about how this function might be improved: I'm sure that it's not general enough for some cases, but it's not yet clear to me how best to generalise it. Thoughts would be very welcome!
Hi @zesterer,
Thank you for adding the semantic_whitespace
parser. It looks promising!
I am new to Chumsky so it took me a while to figure out how to feed the nested token trees from the lexing stage to the next stage. I think I need to use Stream::from_nested
to create a flattened token stream before feeding them to the parser.
I am hoping the exsample could demonstrate that too. So I have updated the example here. Am I on the right track? or is there any better way?
Could that approach be merged back to master?
@tatsuya6502
Hi @zesterer,
Thank you for adding the
semantic_whitespace
parser. It looks promising!I am new to Chumsky so it took me a while to figure out how to feed the nested token trees from the lexing stage to the next stage. I think I need to use
Stream::from_nested
to create a flattened token stream before feeding them to the parser.I am hoping the exsample could demonstrate that too. So I have updated the example here. Am I on the right track? or is there any better way?
Hey, sorry it took so long to respond to this!
Yes, this approach works fine for now. That said, I'm a little hesitant to recommend it as a long-term solution because it's quite verbose. I'm still hopeful that there's a way to add support for parsing token trees into the zero-copy
branch, but it seems... mechanically complicated to say the least.
Could you open a PR with this change to the example? I think it's merge-worthy still, provided it's made clear that the Stream::from_nested
is conceptually distinct from the lexing stage.
Hi @zesterer,
No worries. I opened the PR #166. I simply rebased my branch to the latest master. Please let me know if you want me to change some codes.
That said, I'm a little hesitant to recommend it as a long-term solution because it's quite verbose. I'm still hopeful that there's a way to add support for parsing token trees into the
zero-copy
branch, but it seems... mechanically complicated to say the least.
I hope you will find a way. But I am already quite happy with current Chumsky too. Thank you for creating such a great library!
I'm not certain if the example is meant to cover this, but the pythonic parser does not enforce that the individual tokens be separated by whitespace, meaning ambiguous syntax like
10foo:
20bar
Gets parsed to: [(Int(10), 0..2), (Ident("foo"), 2..5), (Op(":"), 5..6), (Open(Block), 9..14), (Int(20), 9..11), (Ident("bar"), 11..14), (Close(Block), 9..14)]
I could not figure out how to enforce that while using the semantic_whitespace
parser, but that could be down to me not understanding the library well enough yet.
The pythonic
example is just a very simple example, not a full parser for Python code. The issue you're seeing is unrelated to semantic_whitespace
, it's just a product of it being a simple example rather than a fully-fledged parser.
hello, I am not sure whether here is the right place to ask about the question.
it seems that the semantic_indentation
parser will report indentation demilited code blocks with span (index_of_opening_indentation, end_of_the_opening_indentation_line)
. but for parenthesis delimited code block the span is correct.
In pythonic.rs
example, here is a parenthesis delimited code block
and here are the related items in token stream
here is a indentation code block
here are the items
I am writing a parser for a dsl which is a superset of python, and when i am building the lexer, I have seen that as early as the nested TokenTree step, the span is not correct.
obviously the span should be 14..44
I am wandering is this mean that semantic_indentation
parser has some bug? or is this by-design?
I am new to language parser and chumsky, thank you!
@TTTPOB
It's quite likely that semantic_indentation
has either a bug or is not quite able to express what you want: it's a stop-gap for chumsky's (current) lack of true context-sensitive parsing, which is required to parse Python-style indentation properly.
However, there's rapid movement on this front! The zero-copy
branch (due for merge soon) is getting a rugged, powerful system for expressive context-sensitive parsing. You can see an example of this being used to parse Python-style indentation here.
If you feel you can't wait for zero-copy
and a new release (I suspect it's probably about a month away), my best advice would be to parse indentation in a hand-written lexer and pass it on to chumsky as delimiter token, such as C-style braces.
If you feel you can't wait for zero-copy and a new release (I suspect it's probably about a month away), my best advice would be to parse indentation in a hand-written lexer and pass it on to chumsky as delimiter token, such as C-style braces.
Thank you! it looks promising, I will try to write a lexer first, and maybe switch to the new release after.
zero-copy
now supports context-sensitive parsing (see #269), enough to properly implement Python style indentation sensitivity (and much more!). There's even an example.
@zesterer this is fantastic. Thank you so much for all the hard work on this.