parso Soft Keywords and How to Implement Them

Soft Keywords and How to Implement Them

Open isidentical opened this issue 4 years ago • 45 comments

I've been thinking for a day about how we can possibly implement PEP 622 in case of acceptance. While I was thinking, I drafted a piece of code, which might seem totally unreasonable (since it changes the parser to semi-LL(1) from LL(1)) but wanted to share it as a maybe last resort? I tried to document the strategy as inline comments, so feel free to read and comment about it:

https://github.com/davidhalter/parso/compare/master...isidentical:soft-keywords-very-very-bad-demo?expand=1

Jun 25 '20 18:06 isidentical

CC: @davidhalter @bgw

Jun 25 '20 18:06 isidentical

@isidentical Interesting work. I'm not sure we should go this route. An alternative way to deal with "soft keywords" is using something like name: NAME | 'match'. That would of course need a step where match is converted back to a name if it's not a keyword (i.e. if the parent is not a match_stmt). What do you think about that idea?

However I would probably prefer a PEG parser at that point. But at the same time I haven't really thought about that either.

PS: I hope that the PEP gets rejected. IMO this is just additional syntax without major benefits. If pattern matching was at least an expression, then I would maybe see it, but this way it's just useless (you can do most thing with if).

Jun 27 '20 10:06 davidhalter

What do you think about that idea?

Will look into that (and possibly draft a PoC), but at least for now, I don't think it would be the right fit.

However I would probably prefer a PEG parser at that point. But at the same time I haven't really thought about that either.

Yes, me too (if we can). As I said, this is just a workaround for that problem. I, unfortunately, can't start such an action but can participate whenever I'm free. (If anyone interested with the idea, and maybe some rust/c backend).

PS: I hope that the PEP gets rejected. IMO this is just additional syntax without major benefits. If pattern matching was at least an expression, then I would maybe see it, but this way it's just useless (you can do most thing with if).

I ~partially agree. I don't really mind if it gets accepted, but not as is. IMHO It needs major changes including very big simplifications. It took like hours and hours to finally comprehend the whole document, which I believe shows it is a very complex and dense piece of language addition at least in a single transaction. It might be considered to divide the PEP into parts and gradually include/exclude some parts of it. Let's wait and see it, no need to rush :P

Jun 27 '20 11:06 isidentical

Yes, me too (if we can). As I said, this is just a workaround for that problem. I, unfortunately, can't start such an action but can participate whenever I'm free. (If anyone interested with the idea, and maybe some rust/c backend).

Why can't you start? Because it would be written in a lower level language?

It took like hours and hours to finally comprehend the whole document, which I believe shows it is a very complex and dense piece of language addition at least in a single transaction

Yeah, I also didn't read the whole thing, it's simply too long and complicated. But that might just be this iteration of the draft.

Jun 27 '20 13:06 davidhalter

Why can't you start? Because it would be written in a lower level language?

No, I'm currently in the process of getting a job for this summer. And not sure if I can dedicate some time to it, at least solely. If you / or anybody else tries to attempt it, I can help in my free times.

Jun 27 '20 16:06 isidentical

@davidhalter would you check out (and comment if you have anything to say) this? https://bugs.python.org/issue40360

Jul 09 '20 15:07 isidentical

I have thought quite a bit about this. It's similar to your plan, but not exactly the same.

We can probably use our existing LL parser and make it PEG compatible with a small trick: Backtracking in case of ambiguities (otherwise use normal LL). At the backtracking point just use the LL algorithm again for the plan.

This is very similar to PEG and would need some work, but it's probably not even that hard to do. The biggest challenge will probably be keywords/soft keywords and backtracking.

It seems like your solution is quite similar to this, but you seem to transition ahead to make it clear which plan to take. This works of course as well.

Oct 23 '20 19:10 davidhalter

Interesting, do you have time to come up with a draft that supports both soft keywords and the multiline context managers. Obivously, we don't need to have an ideal, an epitome strategy to parse, so we could just embed the backtracking into the LL parser. For the reference, I've written about it: https://tree.science/what-the-backtracking.html

Oct 23 '20 19:10 isidentical

At the moment I have other priorities, but I will definitely come up with something before 3.10 releases if you don't do the work :)

My goal for parso is just to get it running so people can use it. My priorities lie with creating a parser that is both faster and better than parso (and is probably a different project).

Oct 23 '20 19:10 davidhalter

if you don't do the work :)

:-) [inserts a november song]

My goal for parso is just to get it running so people can use it. My priorities lie with creating a parser that is both faster and better than parso (and is probably a different project).

I like hacky ways :sunglasses:

Oct 23 '20 19:10 isidentical

I think your general approach is pretty good, I would probably just use a slightly different approach in the generator (if possible): I would move some of the logic there to calculate the plans beforehand. But I haven't really thought about everything and your solution might be what we need.

Oct 23 '20 19:10 davidhalter

I started some stuff, but calculating the rules beforehand is a bit tricky. The thing that makes backtracking harder for us is error-recovery. If we'd assume all input is valid, we iterate all possible plans (for e.g optional parens for with-stmt) and try to perform a dry-run and see which one of the alternatives completed. And as soon as one completes, we could return it. But since we also do error recovery, we need to take on token at the time, feed all the plans, get the next possible plans, and complete those and repeat this until we find a plan that either takes more input than others.

The proposed patch does this on a very small scale, but for ambigous rules, I might have an alternate patch (still working on, but no luck so far). Basically what it does is, instead of keeping transitions as a 1-to-1 mapping between token_type -> DFA, it keeps them as 1-to-many mapping between token_type -> [DFA1, DFA2, ...]. And when we proceed to parser, we check how many dfas we have, if it is 1, then just return that, if it is more than 1, than try to repeat this process I just mentioned above.

Nov 01 '20 10:11 isidentical

I will answer later, but maybe you can answer something about CPython's PEG grammar. I don't really understand why the tilde in the following code is necessary. I have read the PEP about it and understand that it's usually used to avoid considering alternatives. But in this case there are no alternatives, so why is it there?

listcomp:
    | '[' named_expression ~ for_if_clauses ']'

Nov 02 '20 23:11 davidhalter

I presume you found it on the docs.python.org, which is the stripped version of the official grammar. That rule actually has 2 alternatives (one is only used for error reporting in cases like [a for a in *b], but since it is an error-only alternative it got stripped away. See the original rule. The reason that tilde is required there is, if we already parsed named_expression than that invalid_ alternative is no longer will needed since the only purpose of that alternative was checking whether the iterable is a starred item or not. So if the parser is able to recognize named_expression than it can continue forward without even considering alternatives and if it fails, it will raise a normal syntax error.

Nov 03 '20 08:11 isidentical

Thanks! I was indeed looking at the wrong file. I was aware of the other file, but I thought the docs.python.org one was more readable, so I mostly looked at that. Thanks for the explanation.

About the proposed patch: I'm currently trying a technique in Rust that might work for us as well. I will keep you posted.

Nov 05 '20 17:11 davidhalter

About the proposed patch: I'm currently trying a technique in Rust that might work for us as well. I will keep you posted.

:eyes:

Nov 05 '20 22:11 isidentical

The thing that makes backtracking harder for us is error-recovery

I was reading a paper about error-recovery for PEG parsers, especially the ones that used by IDEs for auto-completition and noticed that I was missing an important aspect regarding backtracking on ambiguities on my imitated fork for parso's LL(1) parser. I believe with something similar to the commit (~) operator in the Pegen, the problem that I mentioned in my former comment can be easily resolved.

Dec 10 '20 16:12 isidentical

status update: PEP 634 accepted with soft keywords

Feb 18 '21 16:02 isidentical

Unfortunately, yes. :)

If we don't have a good solution for PEG/LL(*) parsing in autumn we can maybe also use error recovery. It could work well enough.

Feb 19 '21 19:02 davidhalter

Yeah, that is what I kind of used (not entirely). Do you have any opinions regarding my initial approach? I'm thinking about also introducing a ~ operator to our grammar, so that we can implement multiline with statement.

Feb 19 '21 19:02 isidentical

Give me one more month. I'm pretty far with my approach and would then show it to you.

Feb 19 '21 21:02 davidhalter

@isidentical I'm a bit confused by how the new Grammar uses lookaheads. Negative lookaheads seem to be used in some places for speedups (maybe like 2-3 times), but in all other places they are pretty much useless and might be omitted. Do you happen to know why they are there?

Feb 27 '21 15:02 davidhalter

Can you show like a concrete example?

Feb 27 '21 20:02 isidentical

I realized that again I wasn't reading https://github.com/python/cpython/blob/master/Grammar/python.gram properly. I usually read https://docs.python.org/3.10/reference/grammar.html first to try to understand the grammar somewhat and then refer to the official grammar and try to see if I understood it right, but in this case I didn't see that most ! have an effect on raising SyntaxError.

Feb 27 '21 21:02 davidhalter

@isidentical I'm doing my first few benchmarks with my new Rust Parser. I'm using a 2 MB file that I repeat ten times. So 20 MB of code. Generally parso is like 30 times slower than both the CPython Parser and my Rust parser:

Type	Time	Peak Memory Usage
Parso	85s	0.85 GB
`Python 3.9 parser.suite`	2.9s	1.35 GB
`Python 3.9 -X oldparser ast.parse`	12s	1.48 GB
`Python 3.9 ast.parse`	12.6s	1.93 GB
`Python 3.9 -X oldparser compile exec`	19s	1.48 GB
`Python 3.9 compile exec`	19s	1.46 GB
My Rust Parser	~~2.6s~~ 2.05s	0.185 GB

Numbers are not extremely precise. Just there for a general tendency.

I'm a bit confused by the memory usage. Can you explain this? Especially ast.parse seems to be pretty extreme. 20 MB of code should IMO not take up 1.93 GB of space.

Everything else seems to make sense. AST generation seems to take a lot of time though. My Rust parser still has some space for user definable stuff on the nodes, so that's essentially 250 MB that could be reduced even more if the nodes were compressed properly. I guess <200 MB would be feasible for a full syntax tree for 20 MB of code. It's especially interesting that a pure Python version (parso) of a syntax tree uses less RAM than the CPython version (however it's also extremely slow, not sure how much interning of values is going on there).

Is there a way to test the pure speed of the CPython PEG parser? I'm not interested in the AST stuff.

Mar 08 '21 00:03 davidhalter

I'm a bit confused by the memory usage. Can you explain this?

The difference seems quite big and the only thing that I can think is the memorization cost though it shouldn't cause that much of a difference. There is a way to ~dump memorization stats but it is not really explicit, though feel free to try (the data/xxl.py is just a 100000 file that consists from arithmetic stuff) https://github.com/python/cpython/blob/bbba28212ce0f58096a4043f32442c6e727b74fc/Tools/peg_generator/Makefile#L66-L68

Is there a way to test the pure speed of the CPython PEG parser? I'm not interested in the AST stuff.

Since the parser's only output is the AST, I don't think there is an easy way to see that speed. The only overhead that I see between those is the serialization phase which shouldn't take much time. Though if you want to dive really deep you could use C profiler to profile the _PyPegen_parse function and get the actual time.

Mar 08 '21 14:03 isidentical

Thanks for your comment! After an hour or two of optimizing I got it down from 2.9s to 2.6s and from about 250 MB to 185 MB. It will probably stay there for a while. I could reduce a tree node further to get to 150 MB and then compress it to 120 MB. Not sure if that's all worth it, so I won't do it for now. A tree with ten million nodes would then use 85 MB RAM + 20 MB code + program. Pretty cool :). As a comparison: The Python standard library has ~10 MB of source code.

I'm almost at the place where I can show my concepts to you. I feel like they would be pretty nice for parso as well, but they might be a bit of work.

And I'm definitely to lazy to profile _PyPegen_parse, but good hint :).

Mar 10 '21 00:03 davidhalter

@davidhalter It's interesting that you say that you're working on a Rust parser. I've slowly been working on porting bits of LibCST to Rust for better performance (starting with https://github.com/Instagram/LibCST/pull/452).

I've currently got a 1:1 port of CPython's tokenizer.c (code isn't published yet, but I can put it up if you want to look at it) to Rust, and I'm working on adding support for Parso's f-string splitting behavior. It doesn't currently support any sort of error recovery. I don't know if that would be useful to you. It'd be cool if we could continue to share some infrastructure.

I haven't started on trying to move the parser to Rust yet (though Python 3.10 gives it some urgency), but I've been considering rust-peg for the job. IMO, parsing between Parso and LibCST is different enough (LibCST doesn't need error recovery or incremental parsing) that I think it's fine if we diverge there.

Mar 10 '21 00:03 bgw

It'd be cool if we could continue to share some infrastructure.

I have also been enjoying working with you. However: I'm considering releasing the software as commercial software, so that will probably have to wait for a bit. :) It will probably also be released open source and for free for small companies and individuals, but your company obviously isn't part of that group :).

Mar 10 '21 08:03 davidhalter

I'm considering releasing the software as commercial software, so that will probably have to wait for a bit.

Up to you; commercial software can pull in PSF or MIT licensed dependencies! I've benefited immensely from your open-source work. It's totally fine if you want to take any of my code and use it in a commercial project, and it's okay if you chose not share changes back. I don't really see Jedi/Parso and Fixit/LibCST as competing products: LibCST makes no attempt to target as-you-type IDE use-cases.

It sounds like you already have a tokenizer, but if you're interested, here's my Rust tokenizer implementation. It pretty closely matches the behavior of Parso's tokenizer, except that it doesn't have any error recovery. https://github.com/bgw/LibCST/blob/native/tokenizer/libcst_native/src/tokenize/core/mod.rs

Best of luck on your possible commercial software endeavor. A faster and more capable version of Parso/Jedi would be a no-brainer for lots of companies.

Mar 14 '21 09:03 bgw

@bgw Thanks a lot for sharing. I'm really interested in comparing the two tokenizers in terms of performance. I might not do it now, but I will definitely compare it eventually.

Mar 14 '21 12:03 davidhalter

For what its worth, here is a (~decent looking) python parser written in Rust that I stumbled against a few days ago: https://github.com/ProgVal/rust-python-parser/

Apr 10 '21 16:04 isidentical

@isidentical, I've seen that. It does look decent, except:

It's GPLv3, which would require re-licensing anything that'd want to build on top of it. (Or someone would have to ask the author of that crate if they'd be willing to re-license their code)
Parsing and tokenization must happen in the same pass. Given how many tokenizer hacks Python has, that might make certain edge-cases hard to handle. Any design differences in your architecture compared to CPython make it possibly harder to support syntax changes in the future.
It uses the nom parser combinator library. Nom looks like a very impressive and well-maintained library, but there's some work (and room for mistakes) involved in converting the grammar from PEG format to something that will work with nom's combinators. I'm looking at rust-peg for this reason.
Even if we chose to use that as a starting point, it'd still be a lot of work to adapt that codebase to a lossless syntax tree like Parso and LibCST use.

If you're looking for a Python AST parser in Rust, you should also consider rustpython_parser. They do tokenization in a separate pass. They use lalrpop though, which may not be able to support the Python 3.10 match syntax or other future syntax features that depend on PEG parsing.

I started by trying to wrap RustPython's lexer for LibCST, but I wanted the Rust tokenizer to match with our Python tokenizer's behavior exactly so that it could be a drop-in replacement. There were enough differences that it ultimately ended up being easier for me to write my own implementation instead.

Apr 11 '21 06:04 bgw

@bgw I agree. I'm not a big fan of parser combinators for language grammars. I like proper BNF grammars much more. Nom is definitely a good library, but I think language parsing is not what it's good for. I'm still a big fan of LL parsing, because of it's mathematical beauty :) It also lends itself better towards error recovery than LR parsing. I have never really understood why people would prefer it.

@isidentical I have given you access to my private project. ~~Feel free to accept it~~ It seems like it took me a bit too long to write this :). I will however quickly summarize how I got it working and what I would see as the ideal way forward for parso. This does not mean that your approach is bad at all, it just does more at runtime than it would necessarily need to do. And I would suspect that the approach you chose would be a bit slower than mine. In general however I would argue that our solutions are pretty similar. We both have a way to define soft keywords and then backtrack. The big difference is really where we calculate "alternative" plans.

My approach does pretty much the following:

Normal LL parsing whenever possible, so all LL optimizations actually work.
In handles direct left recursions by pushing something on the stack (this is not important for parso for now, so I won't explain it here).
For PEG-like rules that have the same "first token" (see below where bar starts with the same token as "hi") we always start matching the first (in this case "hi" "baz"). If it does not work out we just continue with the second "branch". This works by saving a backtracking point in the tokenizer and in the tree. It's IMO not that hard and the position where this occurs can be precalculated.

foo: "hi" "baz" | bar "foo"
bar: "hi"+

What do you think about this approach? It is IMO interesting, because it does not need to calculate the superior_plan (that's how you called it in your patch) at runtime. It does it at "parser create time". :) It's a small optimization, but I think it's worth it. What do you think?

Please also let me know if you don't understand what I just tried to explain. It's pretty hard to explain it well and I don't feel like having done a good job. :)

Apr 11 '21 21:04 davidhalter

What do you think about this approach? It is IMO interesting, because it does not need to calculate the superior_plan (that's how you called it in your patch) at runtime. It does it at "parser create time". :) It's a small optimization, but I think it's worth it. What do you think?

I think it makes sense, especially considering that my approach was more of a brute-forcing rather than proper calculations.

Apr 13 '21 18:04 isidentical

Do you want to implement it with a different approach? I wouldn't be too unhappy, since I've already implemented it in another language :)

Apr 15 '21 16:04 davidhalter

Do you want to implement it with a different approach?

Nop, feel free to go with the pre-calculation. I might try to give it a shot, but pretty stacked up until the next month so if anybody is up to the implementation, it would be great!

May 03 '21 21:05 isidentical

Hello. Do we have a plan for this feature needed for full Python 3.10 support?

Oct 26 '21 06:10 frenzymadness

Currently no. I know how this would work with an LL grammar and I'm happy to help people work on it, but for now I have other priorities. For Jedi this is mostly not really an issue, since autocompletions/goto will still work. The error listing is probably broken in 3.10 unfortunately. For everyone who wants to parse 3.10 grammar and work with that, I guess you are out of luck at the moment.

Oct 26 '21 19:10 davidhalter

@davidhalter Is there a possibility to sponsor this feature?

Jul 02 '22 22:07 stnley

Yes. Just note that this is a few days of work (might even be weeks), so giving me 5$ is not going to cut it :)

Jul 03 '22 07:07 davidhalter

Totally understandable - if you don't mind I'll reach out via email?

Jul 03 '22 20:07 stnley

Yes, sure.

Jul 04 '22 08:07 davidhalter

@stnley Did my email answers reach you? I'm just rechecking here.

Aug 04 '22 12:08 davidhalter

parso parso copied to clipboard

Soft Keywords and How to Implement Them

parso
parso copied to clipboard