jinja icon indicating copy to clipboard operation
jinja copied to clipboard

Create a formal definition of Jinja's grammar, refactor the parser and compiler

Open davidism opened this issue 4 years ago • 17 comments

Every time there's an issue that requires looking at the lexer and parser, I worry due to the amount of complexity and under-documented behavior. The lexer depends on a bunch of very complex regular expressions. The parser sometimes seems to overlap in functionality. The lexer's state machine is mostly a giant set of nested for-loops in a single function.

There is documentation about the template syntax, but it's informal, it implies the grammar without actually saying what the rules are. There are tests for the lexer, but not all its branches. Even the entire test suite managed to miss something with the recent issue #1138 about stripping whitespace.

For long term maintainability, I'd feel much more confident if there was a document about the grammar to refer to, as well as test coverage for all its behavior. I'm not familiar with writing grammars, so I'd appreciate help. I'd lean towards the same grammar syntax that Python uses.

davidism avatar Apr 16 '20 15:04 davidism

PEP 617, the new PEG parser in Python 3.9, talks about some similar issues, such as overly broad rules in the parser that are then limited in the compiler.

davidism avatar Apr 28 '20 16:04 davidism

Assigned @kevin-brown since he's been working on this in a branch: https://github.com/pallets/jinja/tree/GH-1194-formal-grammar

Others are still welcome to discuss or work on this too, it's an exploratory issue right now.

davidism avatar May 23 '20 19:05 davidism

@kevin-brown Any specific reasons to go with tatsu? It seems to be only sporadically maintained and not as capable as some of the more popular python parsing solutions like lark

septatrix avatar Jun 09 '20 16:06 septatrix

We had extensive discussion about this in Discord, but I don't plan to add any external dependency for this. We'd want to modify or rewrite our lexer and parser to reflect what we learn here. We could use a library in tests to compare the AST our implementation generates to theirs, but otherwise need to keep the code in Jinja.

davidism avatar Jun 09 '20 16:06 davidism

Yes not adding an extra dependency sounds like a good idea if it can be prevented though isn't tatsu an external dependency? Or is the owner somehow related to the palletsproject?

septatrix avatar Jun 09 '20 16:06 septatrix

Kevin's just using that in tests and to help build the grammar, the branch I linked isn't a PR right now.

davidism avatar Jun 09 '20 17:06 davidism

Okay thanks for clarifying

septatrix avatar Jun 09 '20 19:06 septatrix

Have you considered using the parser from Python 3.9 (with a backport for older versions)? No idea if it's a good choice for this or not, just curious.

ThiefMaster avatar Jun 09 '20 19:06 ThiefMaster

It looks like Tatsu generates PEG parsers, which is what the new 3.9 parser is. I don't think we can directly use it though, it's specific to Python grammar, and the parser module is deprecated too. I am interested in PEG given the attention it's been getting though.

davidism avatar Jun 11 '20 16:06 davidism

Here's Python PEG definition: https://github.com/python/cpython/blob/main/Grammar/python.gram. The two things not explained in the block comment near the top are (I think I got these right): { ... } after an expression is a block of code to run with the result of the intial match, to see if it's valid; and n=e means assign the matched value to that name for use in { ... }.

davidism avatar Apr 19 '23 14:04 davidism

Some general notes on things we could work on. @awhetter and I discussed some of this at PyCascades.

  • Write down a big bullet list of all the Jinja syntax we know exists based on the docs. * name, example. Sublists for different variants.
  • A PEG is unambiguous, it tries each rule in order and takes the first match. Is the current parser ordered in all cases? I'm fairly sure it is, but if not then decide on an order.
  • Get a better understanding of what the current extension system can do. Is it possible to represent it in the grammar? Even as a maintainer I don't really understand the extension mechanism right now. Maybe there's a better or more powerful design, and one that works with type systems better.
  • We don't need to use a parser gernerator like Tatsu (which has runtime requirements) to generate the parser code from the grammar in Jinja itself, we should write our own parser still. However, we can use it to verify that the grammar we write is itself valid.
  • We may want to redesign the Node class and subclasses, they have some "magical" behavior that's not very type friendly right now.
  • There probably wont' be an attempt to keep the existing lexer, parser, AST, or compiler; it's not clear how it would be possible. How widespread are customizations at this level? What about extensions? If we can see popular existing uses, we can try to make the migration process easier. Maybe @mkrizek at Ansible could help answer that. Should also try to reach out to Salt and figure out other big projects using Jinja.
  • Start small, don't try to write all the grammar at once, don't try to parse everything at once.
  • Come up with example templates to try parsing.
  • At the end, templates should still render as they do today. Nothing about that should change (unless this uncovers existing bugs).

davidism avatar Apr 19 '23 15:04 davidism

  • A PEG is unambiguous, it tries each rule in order and takes the first match. Is the current parser ordered in all cases? I'm fairly sure it is, but if not then decide on an order.

When I did my initial attempt at this a few years ago, I determined that the parser was ordered in all cases. I suspect this hasn't changed much in the years since.

kevin-brown avatar Apr 24 '23 04:04 kevin-brown

  • Write down a big bullet list of all the Jinja syntax we know exists based on the docs. * name, example. Sublists for different variants.

At the PyCon US sprints I started working on this in a GitHub Gist. Not quite the bullet list format, but still parseable.

https://gist.github.com/kevin-brown/ec945b60024d35ed9c09ec70142af7fe

kevin-brown avatar Apr 26 '23 18:04 kevin-brown

I'm no expert in this subject but instead of tatsu, could this be done with pegen, the official peg parser generator for 3.10+? The previously linked grammar file for cpython was generated with pegen.

robcxyz avatar Apr 28 '23 06:04 robcxyz

You mean the parser was generated from the grammar file, not that the grammar file was generated. Jinja's refactor can't be done with any parser generator because I don't want to include generated code or runtime dependencies. The code needs to be debuggable and maintainable on its own. We can certainly look at the output of pegen, but the final product has to be plain Python that a user would write.

davidism avatar Apr 28 '23 08:04 davidism

There probably wont' be an attempt to keep the existing lexer, parser, AST, or compiler; it's not clear how it would be possible. How widespread are customizations at this level? What about extensions? If we can see popular existing uses, we can try to make the migration process easier. Maybe @mkrizek at Ansible could help answer that.

Regarding Ansible code base (which I can only comment on), there aren't any customizations on lexer, parser, AST or compiler level. Ansible does however make use of the Environment.lex method: https://github.com/pallets/jinja/blob/953acd65b2be5428482dcb60f5b8481b66252ac9/src/jinja2/environment.py#L621-L640

See https://github.com/ansible/ansible/blob/cdeb607b1d01af4d8dd652bf7117a442fcd9b229/lib/ansible/template/init.py#L113-L153 and https://github.com/ansible/ansible/blob/cdeb607b1d01af4d8dd652bf7117a442fcd9b229/lib/ansible/template/init.py#L175-L210


Not related but for completeness I'll note that Ansible customizes: ...
  • Environment, Context and Template classes
  • environment.concat function
  • a custom mapping class to represent available variables (instead of built-in dict).

mkrizek avatar May 10 '23 14:05 mkrizek

I'll keep that in mind. I do have a general goal of making the extension and introspection mechanisms more powerful and easier to use, so ideally it should be easier to do that sort of modification during parsing. I'm open to ideas on what that API might look like, although it's still way too early to know what it will really be.

davidism avatar May 10 '23 17:05 davidism