chumsky icon indicating copy to clipboard operation
chumsky copied to clipboard

Example of Python-like parsing

Open VinnyVicious opened this issue 2 years ago • 11 comments

Having an example on how to parse Python-like languages that are aware of indentation would be interesting.

VinnyVicious avatar Nov 03 '21 21:11 VinnyVicious

Python's parser/lexer (whichever processes identation, I've not checked) is not context-free because correctly parsing each line relies on the contextual information available in previous lines (i.e: lines 'pair up' with neighbouring lines into blocks if their indentation matches).

Chumsky (and virtually all parser combinators and parser generators) does not play well with context-sensitive grammars. It's usually necessary to hand-write a context-sensitive parser for such languages.

All is not lost, however! A neat trick you can use is to write your lexer by hand and have it translate Pythonic indentation into traditional code blocks with delimiters. Once this is done, the tokens can be fed into Chumsky and parsed in a context-free manner.

I would actually advise you use this strategy: get rid of the context-sensitive indentation as early as you possibly can in the compiler, it makes parsing (and representing the AST later on) needlessly complex.

I've actually written an example lexer that does this in the past, complete with correct handling of parentheses and newlines. You can take a look at it here. Once you've turned the text into a stream of tokens, they're pretty easy to feed to Chumsky.

It's probably possible to implement something similar as custom parser for Chumsky too, although it's obviously a bit ugly for the reasons I mentioned above. I'll look into this a little more to see if it's possible.

zesterer avatar Nov 04 '21 10:11 zesterer

Thank you for the detailed answer and the bonus example! Really appreciate it.

VinnyVicious avatar Nov 04 '21 18:11 VinnyVicious

Something I have seen before is introducing special INDENT and OUTDENT tokens in your lexer. I thought that was quite clever.

Xandaros avatar Nov 09 '21 23:11 Xandaros

Something I have seen before is introducing special INDENT and OUTDENT tokens in your lexer. I thought that was quite clever.

Yep, that's how I've seen a lot of lexer/parser combinations do it. Although I find it then hard to work with languages (like Python) that use newlines as a separator and as a whitespace.

mainrs avatar Nov 12 '21 14:11 mainrs

@zesterer Apologies for the bump after two weeks, but I'm curious.

I'll look into this a little more to see if it's possible.

If it turns out not to be possible, could an indent-aware parsing example still be included in the /examples/ directory? That way, users would know how they might make Chumsky work with indent-aware content, even if it requires some hand-written code on their end.

LikeLakers2 avatar Nov 30 '21 00:11 LikeLakers2

I've had a think about it and although I believe that it should be possible to implement, I don't think that any implementation I could add to Chumsky would be sufficiently flexible to cover all use-cases.

In terms of a hand-written example implementation, I'm very happy to accept something like this if someone is interested in implementing it. Right now, unfortunately, I'm focusing on a number of other things so I likely won't get time to work on it until after New Year.

zesterer avatar Nov 30 '21 15:11 zesterer

I've actually made some progress on this! master now contains a parser called semantic_whitespace that can be used to tokenise Python-style syntax into token trees, respecting semantic indentation. You can find an example of its use here.

I'd really like feedback about how this function might be improved: I'm sure that it's not general enough for some cases, but it's not yet clear to me how best to generalise it. Thoughts would be very welcome!

zesterer avatar Apr 14 '22 17:04 zesterer

Hi @zesterer,

Thank you for adding the semantic_whitespace parser. It looks promising!

I am new to Chumsky so it took me a while to figure out how to feed the nested token trees from the lexing stage to the next stage. I think I need to use Stream::from_nested to create a flattened token stream before feeding them to the parser.

I am hoping the exsample could demonstrate that too. So I have updated the example here. Am I on the right track? or is there any better way?

tatsuya6502 avatar May 01 '22 07:05 tatsuya6502

Could that approach be merged back to master?

VinnyVicious avatar Jul 05 '22 14:07 VinnyVicious

@tatsuya6502

Hi @zesterer,

Thank you for adding the semantic_whitespace parser. It looks promising!

I am new to Chumsky so it took me a while to figure out how to feed the nested token trees from the lexing stage to the next stage. I think I need to use Stream::from_nested to create a flattened token stream before feeding them to the parser.

I am hoping the exsample could demonstrate that too. So I have updated the example here. Am I on the right track? or is there any better way?

Hey, sorry it took so long to respond to this!

Yes, this approach works fine for now. That said, I'm a little hesitant to recommend it as a long-term solution because it's quite verbose. I'm still hopeful that there's a way to add support for parsing token trees into the zero-copy branch, but it seems... mechanically complicated to say the least.

Could you open a PR with this change to the example? I think it's merge-worthy still, provided it's made clear that the Stream::from_nested is conceptually distinct from the lexing stage.

zesterer avatar Jul 05 '22 14:07 zesterer

Hi @zesterer,

No worries. I opened the PR #166. I simply rebased my branch to the latest master. Please let me know if you want me to change some codes.

That said, I'm a little hesitant to recommend it as a long-term solution because it's quite verbose. I'm still hopeful that there's a way to add support for parsing token trees into the zero-copy branch, but it seems... mechanically complicated to say the least.

I hope you will find a way. But I am already quite happy with current Chumsky too. Thank you for creating such a great library!

tatsuya6502 avatar Jul 06 '22 17:07 tatsuya6502

I'm not certain if the example is meant to cover this, but the pythonic parser does not enforce that the individual tokens be separated by whitespace, meaning ambiguous syntax like

10foo:
    20bar

Gets parsed to: [(Int(10), 0..2), (Ident("foo"), 2..5), (Op(":"), 5..6), (Open(Block), 9..14), (Int(20), 9..11), (Ident("bar"), 11..14), (Close(Block), 9..14)] I could not figure out how to enforce that while using the semantic_whitespace parser, but that could be down to me not understanding the library well enough yet.

ehllie avatar Sep 30 '22 12:09 ehllie

The pythonic example is just a very simple example, not a full parser for Python code. The issue you're seeing is unrelated to semantic_whitespace, it's just a product of it being a simple example rather than a fully-fledged parser.

zesterer avatar Sep 30 '22 12:09 zesterer

hello, I am not sure whether here is the right place to ask about the question. it seems that the semantic_indentation parser will report indentation demilited code blocks with span (index_of_opening_indentation, end_of_the_opening_indentation_line). but for parenthesis delimited code block the span is correct. In pythonic.rs example, here is a parenthesis delimited code block image and here are the related items in token stream image

here is a indentation code block image here are the items image

I am writing a parser for a dsl which is a superset of python, and when i am building the lexer, I have seen that as early as the nested TokenTree step, the span is not correct. image

obviously the span should be 14..44

I am wandering is this mean that semantic_indentation parser has some bug? or is this by-design? I am new to language parser and chumsky, thank you!

TTTPOB avatar Feb 09 '23 16:02 TTTPOB

@TTTPOB

It's quite likely that semantic_indentation has either a bug or is not quite able to express what you want: it's a stop-gap for chumsky's (current) lack of true context-sensitive parsing, which is required to parse Python-style indentation properly.

However, there's rapid movement on this front! The zero-copy branch (due for merge soon) is getting a rugged, powerful system for expressive context-sensitive parsing. You can see an example of this being used to parse Python-style indentation here.

If you feel you can't wait for zero-copy and a new release (I suspect it's probably about a month away), my best advice would be to parse indentation in a hand-written lexer and pass it on to chumsky as delimiter token, such as C-style braces.

zesterer avatar Feb 09 '23 21:02 zesterer

If you feel you can't wait for zero-copy and a new release (I suspect it's probably about a month away), my best advice would be to parse indentation in a hand-written lexer and pass it on to chumsky as delimiter token, such as C-style braces.

Thank you! it looks promising, I will try to write a lexer first, and maybe switch to the new release after.

TTTPOB avatar Feb 10 '23 01:02 TTTPOB

zero-copy now supports context-sensitive parsing (see #269), enough to properly implement Python style indentation sensitivity (and much more!). There's even an example.

zesterer avatar Feb 20 '23 22:02 zesterer

@zesterer this is fantastic. Thank you so much for all the hard work on this.

VinnyVicious avatar Apr 18 '23 22:04 VinnyVicious