lark icon indicating copy to clipboard operation
lark copied to clipboard

Making a template engine with Lark

Open Xowap opened this issue 3 years ago • 5 comments

I'm trying to develop a small template engine with Lark. While I've contemplated making it myself with a bunch of regular expressions, I think it would make more sense to do it with Lark for maintainability.

The thing is, I want to be able to do something like that:

foo = __my.thing__

# IF my.thing
foo += 42
# ENDIF

Here the idea is __ is a bit like {{/}} in a Django/Jinja engine. Eventually, I want to interpolate __my.thing__ into the value of my.thing from the context (well, context['my']['thing']).

The IF/ENDIF block are self-evident I think.

And beyond the __something__ and the # IF/# ENDIF blocks I don't want to parse anything, it's a blob to me. The goal is to be completely agnostic to the text surrounding my blocks.

What I'm having a hard time wrapping my head around, is how to tell Lark that I'm just interested in things that match and I just want a "copy" of everything in-between.

I've tried something like that:

file: line "\n"
    | line

line: line* "__" reference "__" line*
    | TEXT

reference: KEY
         | KEY "." reference

KEY: /[^\s\r\n\t.]+/

TEXT: /[^\n]+?/

But obviously it's a spectacular failure as it's just making one node of the tree for each letter of the text.

I'm grateful for any insight :)

Xowap avatar Sep 15 '22 23:09 Xowap

Hi @Xowap ,

The reason you're getting one letter each time, is that you're using a non-greedy regexp in TEXT (+?) which means it only matches one character. You should make it greedy.

See this as an example: https://github.com/lark-parser/lark/blob/master/examples/advanced/conf_lalr.py

To avoid it "taking over" everything, you might need to put it in a low priority, like TEXT.-1. But that might not be necessary.

erezsh avatar Sep 16 '22 08:09 erezsh

Thank you @erezsh !

There was many issues in my original grammar, I'm now at:

file: line+

line: line_content* "\n"
    | line_content+

line_content: inline_ref
            | TEXT

inline_ref: "__" reference "__"

reference: KEY ("." KEY)*

TEXT.-1: /[^\n]+/

KEY: /([^\s\r\n\t._]|_[^\s\r\n\t._])+/

My test is on this conceptual text input:

#!/usr/bin/python

import foo
import bar
import __project.name__

__project.name__

my_thing = "__project.NAME__" 

Unfortunately I'm still getting bits like:

Token("RULE", "line"),
[
    Tree(
        Token("RULE", "line_content"),
        [Token("TEXT", "import __project.name__")],
    )
],

While I'd expect something more like:

Token("RULE", "line"),
[
    Tree(
        Token("RULE", "line_content"),
        [
            Token("TEXT", "import "),
            Tree(
                Token("RULE", "reference"),
                [
                    Token("KEY", "project"),
                    Token("KEY", "name"),
                ],
            )
        ],
    )
],

So basically it doesn't look like the priority is observed? Or something is wrong in my expression somehow?

Thank you :)

Xowap avatar Sep 19 '22 23:09 Xowap

As you say, "import" is TEXT, but TEXT doesn't stop at whitespace, it is greedy and matches to the end of the line.

If you want "import" to mean something, make a rule for it that includes that keyword.

Or make TEXT stop at whitespace. But I don't think that's the best way.

erezsh avatar Sep 20 '22 06:09 erezsh

Well I don't want it to mean anything, I just want to replace values that are between __ and __.

Typically, the end game would be to detect things like:

blip blop bloop __value__ blap

And then get something like

  • TEXT "blip blop bloop "
  • reference "value"
  • TEXT " blap"

Then I'll just compute the reference and leave the text untouched

Not sure if I'm explaining myself properly?

Xowap avatar Sep 20 '22 06:09 Xowap

Then maybe you should make TEXT scan until it reaches __. You can use regex lookahead syntax for that.

erezsh avatar Sep 20 '22 07:09 erezsh

That helped, thanks!

For further readers, here is my grammar in the end:

file: line+

line: line_content* "\n"
    | line_content+

line_content: inline_ref
            | TEXT

inline_ref: REF_DELIM reference REF_DELIM

reference: KEY ("__" KEY)*

TEXT.-1: /([^\n_]+|__(?!_)|_(?!__))/

REF_DELIM: /___/

KEY: /([a-zA-Z-])([a-zA-Z-0-9-]|_(?!_))*/

Xowap avatar Sep 21 '22 18:09 Xowap