grammars-v4 icon indicating copy to clipboard operation
grammars-v4 copied to clipboard

Bring Python3 grammar up to date with current Spec grammar

Open kaby76 opened this issue 1 year ago • 19 comments

This concerns requests for updating the Python grammar to the latest standard grammar, which is a CPython grammar at https://docs.python.org/3/reference/grammar.html.

  • Missing match_stmt. https://stackoverflow.com/q/74563209/4779853. A description is here.
  • https://github.com/antlr/grammars-v4/issues/2462

Note, our current grammar at python/python3 is significantly different from the Spec. This should be minimized, with comments to explain how and why rules were refactored from the Spec. Without this information, we cannot determine whether the refactoring was correct or not.

kaby76 avatar Nov 25 '22 14:11 kaby76

I do not know why, but replies to emails are not getting posted in the Issue thread. I have noticed this several times before, on different repos. Github has a serious, major bug, which can negatively affect all development.

Please do not reply via email to an issue thread. Go to the webpage, and manually enter it on the page.

https://github.com/community/community/discussions/3287 https://www.google.com/search?q=email+replies+are+not+posted+in+issue+thread+github&rlz=1C1NDCM_enUS742US742&oq=email+replies+are+not+posted+in+issue+thread+github&aqs=chrome..69i57j69i64.13320j0j4&sourceid=chrome&ie=UTF-8

Specifically, this was written by @RobEin

Maybe you could try this: https://github.com/RobEin/ANTLR4-Python-grammar-by-PEG It has some bugs, but it basically works. Unfortunately, I don't have time to develop it further at the moment.

For sure, I will be looking over your grammar to examine the rewrites you made to get a functional grammar. But, from here, I am working out automated steps to scrape the grammar directly from the text at https://github.com/python/cpython/tree/3.11/Grammar

I have an Antlr4 grammar that reads Pegen grammars here. Using that and Trash, it's easy to get the grammar in Antlr4 syntax minus changes for semantic predicates.

kaby76 avatar Nov 29 '22 18:11 kaby76

Please do not reply via email to an issue thread. Go to the webpage, and manually enter it on the page.

I agree. Also, It can insert unwanted excess quoting of previos messages. Honesly I have never tried answering via email.

KvanTTT avatar Nov 30 '22 23:11 KvanTTT

I did not reply from e-mail. I wrote a response from the webpage, but I deleted it in a few minutes. That was.

Thank you for looking over my parser, I would be very happy to find the mistake. The Antlr4 grammar for Pegen is a great idea. I thought about something like that but I finally wrote a PEG-to-ANTLR4 translator in Java by some regex replacements. I know it's not a nice solution but it works. I suspect the mistake is here, at least I couldn't figure out a good translation to the following PEG rules:

attr:
    | name_or_attr '.' NAME 
name_or_attr:
    | attr
    | NAME

Equvivalent ANTLR4 rules (wrong translation):

attr
    : name_or_attr '.' NAME;
name_or_attr
    : name_or_attr '.' NAME
    | NAME;

RobEin avatar Dec 05 '22 18:12 RobEin

My scraper is progressing. I had significant changes in Trash to complete in order to make these scripts work fast. The current representation of parse trees in Antlr is probably the worse way to represent a tree for tree rewriting. I changed the representation of parse trees for Trash to not use token and char streams--which have explicit, constant start and stop indices into buffers--and instead hang tokens and off-channel text directly off the tree nodes. Thus, when a node is modified, there is no recomputation of start and stop indices for Intervals and tokens.

  • The "Doc" version of the Python grammar is posted at https://docs.python.org/3/reference/grammar.html, but the actual grammar in Pegen syntax (Python's PEG parser generator) is in a different location, https://github.com/python/cpython/blob/9d2dcbbccdf4207249665b58c8bd28430c4f7afd/Grammar/python.gram. The first challenge is to just make sure the grammars agree. So, I wrote a "strip" script to remove some of the Pegen syntax that is missing in the Doc grammar.
  • The Doc version of the grammar has some of the lookahead stripped out, but not all. It seems it's done inconsistently by the Doc or Python Compiler folks when writing the Doc grammar. Whether they do that through a strip by tool or by hand I cannot say. For example, the simple_stmt is listed in the Doc version of the grammar without positive lookahead &'return', but leaves it in for other rules, like &(';' | NEWLINE) in rule del_stmt. Lookahead starts with either !- or &-operators. In either case, I think this lookahead is unnecessary for Antlr4 because AdaptivePredict() should choose the right alternative when offered a choice as in simple_stmt. The only question I have is whether the lookahead past the alt, such as in del_stmt, will be required in some manner. Antlr does not do backtracking.
# &e
#   Succeed if e can be parsed, without consuming any input.
# !e
#   Fail if e can be parsed, without consuming any input.
  • The ~-operator is also not required, but the expressions following the operator are all required.
# ~
#   Commit to the current alternative, even if it fails to parse.
  • The "forced atom" syntax &&e is similar to the ~-operator. I do not understand the difference, but my guess is that for conversion to Antlr4 syntax, it too is just deleted.
  • Approximately 20% (46/229) of the rules are garbage "invalid_" rules (trparse -d pegen ../examples/python-11.gram | trxgrep ' //rule_/rulename/name/NAME/text()' | grep invalid_ | wc). These rules are a terrible kludge for error recovery. It would have been better if the compiler writers would have just implemented a good error recovery package. The scraper simply deletes all these rules, and the references to them.
  • The Pegen grammar meta grammar is here. It appears that version 3.11 of the grammar is not needed for version 3.11 of the Python grammar. I converted the grammar to Antlr4, available here. One thing I found particularly annoying is that while the grammar accepts EBNF, the Pegen grammar choses to use recursion instead of the kleene-operator. E.g., rules.
  • Stripping out the unnecessary Pegen syntax results in a stripped grammar essentially identical to the Doc. The stripper script is show below, and calls mostly Trash commands to parse, delete, and replace elements in the parse tree of the grammar, but has a grep and sed at the end since XPath does not have nice ways to match a node and include also a sibling easily. The scriper script contains additional chnages for Antlr syntax. There is no XLST engine yet, so the modification script contains only simple deletes, inserts, and replaces.
#
# "Setting MSYS2_ARG_CONV_EXCL so that Trash XPaths do not get mutulated."
export MSYS2_ARG_CONV_EXCL="*"
date
trparse -d pegen ../examples/python-11.gram > orig.pt
cat orig.pt \
	| trdelete '//meta/(AT | name | string | newline)' \
	| trdelete '//rule_[rulename/name/NAME/text()="invalid_kvpair" or
	rulename/name/NAME/text()="invalid_arguments" or
	rulename/name/NAME/text()="invalid_kwarg" or
	rulename/name/NAME/text()="expression_without_invalid" or
	rulename/name/NAME/text()="invalid_legacy_expression" or
	rulename/name/NAME/text()="invalid_expression" or
	rulename/name/NAME/text()="invalid_named_expression" or
	rulename/name/NAME/text()="invalid_assignment" or
	rulename/name/NAME/text()="invalid_ann_assign_target" or
	rulename/name/NAME/text()="invalid_del_stmt" or
	rulename/name/NAME/text()="invalid_block" or
	rulename/name/NAME/text()="invalid_comprehension" or
	rulename/name/NAME/text()="invalid_dict_comprehension" or
	rulename/name/NAME/text()="invalid_parameters" or
	rulename/name/NAME/text()="invalid_default" or
	rulename/name/NAME/text()="invalid_star_etc" or
	rulename/name/NAME/text()="invalid_kwds" or
	rulename/name/NAME/text()="invalid_parameters_helper" or
	rulename/name/NAME/text()="invalid_lambda_parameters" or
	rulename/name/NAME/text()="invalid_lambda_parameters_helper" or
	rulename/name/NAME/text()="invalid_lambda_star_etc" or
	rulename/name/NAME/text()="invalid_lambda_kwds" or
	rulename/name/NAME/text()="invalid_double_type_comments" or
	rulename/name/NAME/text()="invalid_with_item" or
	rulename/name/NAME/text()="invalid_for_target" or
	rulename/name/NAME/text()="invalid_group" or
	rulename/name/NAME/text()="invalid_import_from_targets" or
	rulename/name/NAME/text()="invalid_with_stmt" or
	rulename/name/NAME/text()="invalid_with_stmt_indent" or
	rulename/name/NAME/text()="invalid_try_stmt" or
	rulename/name/NAME/text()="invalid_except_stmt" or
	rulename/name/NAME/text()="invalid_finally_stmt" or
	rulename/name/NAME/text()="invalid_except_stmt_indent" or
	rulename/name/NAME/text()="invalid_except_star_stmt_indent" or
	rulename/name/NAME/text()="invalid_match_stmt" or
	rulename/name/NAME/text()="invalid_case_block" or
	rulename/name/NAME/text()="invalid_as_pattern" or
	rulename/name/NAME/text()="invalid_class_pattern" or
	rulename/name/NAME/text()="invalid_class_argument_pattern" or
	rulename/name/NAME/text()="invalid_if_stmt" or
	rulename/name/NAME/text()="invalid_elif_stmt" or
	rulename/name/NAME/text()="invalid_else_stmt" or
	rulename/name/NAME/text()="invalid_while_stmt" or
	rulename/name/NAME/text()="invalid_for_stmt" or
	rulename/name/NAME/text()="invalid_def_raw" or
	rulename/name/NAME/text()="invalid_class_def_raw" or
	rulename/name/NAME/text()="invalid_double_starred_kvpairs"]' \
	| trdelete '//action' \
	| trreplace '//attribute' ' ' \
	| trreplace '//attribute_name' ' ' \
	| trreplace '//lookahead' ' ' \
	| trdelete '//memoflag' \
	| trdelete '//forced_atom/AMPER' \
	| trdelete '//alt[items/named_item/item/atom/name/NAME[contains(text(),"invalid_")]]' \
	| trdelete '//VBAR[not(following-sibling::*)]' \
	| trtext \
	| grep -E -v '^[ \t]+[|]$' \
	| sed "s#default : '=' expression |#default : '=' expression#" \
	> stripped.gram
date
  • I haven't yet looked at the mutual left recursion in [attr, name_or_attr], but the refactoring seems correct. In general, to remove the mutual left recursion, rules are unfolded into the other rules participating in the recursion. The only choice is which rules to perform the unfolding, and how to reorder the alts.

kaby76 avatar Dec 16 '22 09:12 kaby76

I found a few errors and fixed them. It looks like it's parsing properly now. https://github.com/RobEin/ANTLR4-Python-grammar-by-PEG

However, it is still very slow.

RobEin avatar Jul 31 '23 08:07 RobEin

I am happy to announce that I have fixed the speed issues. There were a lot of redundancies in the grammar that came from the structure of CPython. Mainly because of actions of alternatives and 'invalid-rules'. I eliminated these redundancies and now the parsing runs at an acceptable speed. A redundant rule example among many:

// from the official PEG grammar
if_stmt:
    | 'if' named_expression ':' block elif_stmt 
    | 'if' named_expression ':' block [else_block]
elif_stmt:
    | 'elif' named_expression ':' block elif_stmt 
    | 'elif' named_expression ':' block [else_block]


// the translated redundancy free ANTLR4 grammar
if_stmt
    : 'if' named_expression ':' block (elif_stmt | else_block?)
    ;
elif_stmt
    : 'elif' named_expression ':' block (elif_stmt | else_block?)
    ;

ANTLR4-Python-parser-by-PEG I hope I can pull it to the antlr/grammars-v4 soon.

RobEin avatar Sep 14 '23 11:09 RobEin

That sound great. Make sure to put that into https://github.com/antlr/grammars-v4/tree/master/python/python3. In particular:

  • Update each of the targets. This is important in order so we can run the performance tools and compare parsing correctness across targets. It might illuminate bugs in the runtimes.
  • Update the readme https://github.com/antlr/grammars-v4/tree/master/python/python3#readme with up to date information. Please try to follow the format used in the plsql grammar. https://github.com/antlr/grammars-v4/tree/master/sql/plsql#readme
  • If you add your own test.py program, please move that to some other directory other than Python3/, otherwise it will interfere with the generated template driver for testing in the grammars-v4 repo.
  • I cleaned up a bit of the python/ grammars a few months ago, but I left some of the cleanup still "to do". For example: python/python2-js should be merged into python/python2/. I'm still unclear what "tiny-python/" is all about. I'm not really happy having a merged "python/python/" grammar because Python3 is not really compatible with parsing Python2, but someone went ahead and made this merged grammar anyway. Not only is it a false language, but it's out of date with the latest for Python3, and just added to more unnecessary maintenance.

kaby76 avatar Sep 14 '23 11:09 kaby76

Thanks for the instructions. A C# target is in progress, but I'm afraid I won't have time for the other languages these days: C++ Dart Go JavaScript TypeScrpt

Instead, if I may suggest the following directory structure:

grammars-v4
   python
      python2
         python2_7_13
            target_JavaScript   (currently: grammars-v4/python/python2-js)
            target_Python       (currently: grammars-v4/python/python2)
         python2_7_18           (currently: RobEin/ANTLR4-parser-for-Python-2.7.18)
      python3
         python3_6              (currently: grammars-v4/python/python3)
         python3_8_17           (currently: RobEin/ANTLR4-parser-for-Python-3.8)
         python_PEG             (currently: RobEin/ANTLR4-Python-parser-by-PEG)
      python_other
         2_and_3_universal      (currently: grammars-v4/python/python)
         tiny_python            (currently: RobEin/tiny-python)

I recommend the directory name "python_PEG" because, according to my plans, as soon as a new Python PEG grammar comes out, I would immediately update it to ANTLR4, so that there would be no broken links due to directory name python_version_number.

The Tiny Python is really not a living language, you can easily delete it. (however, it might be useful for educational purposes, similar to Tiny C)

RobEin avatar Sep 14 '23 21:09 RobEin

No problems with the ports. When the PR is merged, I can pick it up from there. Generally I use Chatgpt to come out with an initial version for other targets. It usually gets the translation wrong but close enough to complete.

The directory structure looks fine. I don't care for the current layout because it is not cleear what version the grammar is trying to support and know what was added. This solves that problem. And adding a version to the directory name dissuades people from adding little bits and pieces of the next version of the language.

kaby76 avatar Sep 14 '23 23:09 kaby76

Ok, so first I update my own repository with new Python versions. How often should I update the grammars-v4 repository?

Instead of the directory name "python_PEG", is the name Python_version recommended? e.g: python3_11_5

And if a new grammar comes out, e.g. 3.12.1, should the previous (python3_11_5) directory remain or be deleted?

  • if the old directories are removed, it creates dead links all over the web
  • if the old directories remain, sooner or later many directories will accumulate in the grammars-v4/python repository

I don't know which is the better choice.

RobEin avatar Sep 15 '23 13:09 RobEin

How often should I update the grammars-v4 repository?

That's up to you. None of us have any obligation to work on this stuff. It's FOSS, and transactional.

But, I do plan to look over your grammar and create a list of transforms you made to take the Spec grammar to the Antlr4 grammar. The goal would be to automate this process using Trash.

is the name Python_version recommended? e.g: python3_11_5

That sounds fine.

And if a new grammar comes out, e.g. 3.12.1, should the previous (python3_11_5) directory remain or be deleted?

It would be best to remove older version. But, that depends on what others do to the grammar after you created the initial version.. For sure it will be a maintenance issue if we would then have to apply a change around to all the different versions.

kaby76 avatar Sep 15 '23 13:09 kaby76

An automatic translation would be great. The translated ANTLR4 grammar reflects the PEG grammar, i.e. all rules start on the same line in both grammars.

Those ANTLR4 rules where the semicolons are at the end of the line were automatically translated.

return_stmt
    : 'return' star_expressions?;

And where the semicolons are on a new line, I made manual changes to them:

attr
    : NAME ('.' NAME)+
    ;
name_or_attr
    : NAME ('.' NAME)*
    ;

Except for the previous two rules, all rules that are closed in a new line were translated manually due to redundancy.

I used regex replaces for automatic translation, which completely translates the current PEG grammar, except for the two rules mentioned above. However, the resulting grammar is very slow due to redundancies. I would also like to mention that &e and !e expressions in the PEG grammar were not translated. Except for one, I ignored all of them from the translation because they do not consume tokens and in my opinion are only used to increase the speed in the current PEG grammar, but in the ANTLR4 grammar is slowed down by the many semantic predicates. The excepted rule:

pattern_capture_target
    :  {self.isnotEqualCurrentTokenText("_")}? NAME;

If you are interested, I can give you the code of the regex translator.

RobEin avatar Sep 15 '23 15:09 RobEin

However, a parser-based transpiler would be much more reliable. With the regex, a space in a wrong place or a line break is enough and the pattern matching no longer works properly. And its maintenance is quite difficult.

RobEin avatar Sep 15 '23 18:09 RobEin

Yes, regex rules to convert text is not very reliable. Trash works with the parse tree, and XPath rules to describe the nodes of the tree that need to be deleted/replaced/moved. Of course, it would also suffer from the same problem as regex pattern matching if the rule changes so much to be unrecognizable. But, we could work around this by testing the grammar to make sure the pattern matches before actually doing the modification.

kaby76 avatar Sep 15 '23 18:09 kaby76

I did a diff between your ".peg" file and the official cpython grammar "python.gram" and there seem to be some differences. (I wouldn't worry about it yet, as your new grammar will be better than the current python/python3 grammar.)

Here is what I did:

git clone https://github.com/python/cpython.git
git clone https://github.com/RobEin/ANTLR4-Python-parser-by-PEG.git
cd cpython
git checkout v3.11.5
cd ..
diff ANTLR4-Python-parser-by-PEG/Python3_11_5_official_grammar.peg cpython/Grammar/python.gram > out.txt

out.txt

Many of the diffs are with attributes in the grammar, e.g., file[mod_ty]: a=[statements] vs. file: [statements] ENDMARKER

I do have the meta grammar in an Antlr4 grammar here and here. It can parse the python.gram file. I haven't tried stripping the attributes for a good grammar comparison.

The plan is to add the pegen meta grammar as is to the grammars-v4 repo so we can automate the scraping. There was some reason why I didn't do that a while ago, likely because I didn't quite finish porting the grammar in all the fringe cases. But, python.gram itself doesn't use these features of the meta syntax, so the meta grammar (pegen) should be usable as is.

kaby76 avatar Sep 16 '23 13:09 kaby76

I've created a PR for adding a version of the grammar for Pegen to grammars-v4. https://github.com/antlr/grammars-v4/pull/3711

kaby76 avatar Sep 16 '23 20:09 kaby76

My ".peg" grammar comes from here which is already stripped from the python.gram. But so far there has always been a mistake. For example, an invalid-rule-name (invalid_default) was left in the current one: default: '=' expression | invalid_default

It didn't occur to me that automatic translation should/can be done from the "cpython/Grammar/python.gram". It's a good idea because, as I mentioned, there has always been some error in the stripped-down grammar.

I will study the out.txt. The possible differences may also arise from the fact that my ".peg" grammar is version 3.11.5 and the current python.gram is 3.13.0 alpha 0. For example this file already contains a 3.13.0 feature, which my parser cannot parse: cpython/Lib/typing.py However, I also have an experimental version 3.13.0 which parses it properly.

RobEin avatar Sep 16 '23 22:09 RobEin

I fixed several problems with Trash, and extended the script to read the .gram file and output a grammar similar to the one in the documentation. The doc and the .gram files (both version 3.11.5) seem to be almost identical. One type of difference is the lookahead and lookbehind expressions, which are inconsistently in the doc. I pointed this out as an Issue long ago, but no one addressed it.

python.txt derived.txt doc.txt strip-python.txt

trparse python.gram | bash strip-python.txt | trtext > derived.txt

There are still some issues with the Trash toolkit and the stripping script, but I'm working to correct the issues.

kaby76 avatar Sep 19 '23 11:09 kaby76

Your Trash transformation cleaned up the CPython grammar nicely. I also did a dif between your filtered grammar (derived.txt) and the officially filtered grammar. They are almost the same. It is indeed strange that the production CPython grammar has fewer lookahead expressions than the officially filtered grammar. (???)

At the time, I just looked over a couple of these lookahead expressions and suspected that they had no grammatical significance. I didn't study all the expressions, but rather made an experiment, and removed all of them from my grammar, except for 1 lookahead expression. These were about 40 lookahead expressions implemented in my ANTLR4 grammar by semantic predicates. I also tried it just to see how much the analysis speed changes. The speed has improved and strangely enough, no parsing errors have occurred with any Python source code. So it seems that lookahead expressions have no grammatical role. I think they are only there to increase the parsing speed. This is an empirical conclusion, which of course may not be a 100% good conclusion. That's why I wrote a question about it today in a Python forum. I hope we get a satisfactory answer.

RobEin avatar Sep 19 '23 21:09 RobEin