lark
lark copied to clipboard
Making a template engine with Lark
I'm trying to develop a small template engine with Lark. While I've contemplated making it myself with a bunch of regular expressions, I think it would make more sense to do it with Lark for maintainability.
The thing is, I want to be able to do something like that:
foo = __my.thing__
# IF my.thing
foo += 42
# ENDIF
Here the idea is __ is a bit like {{/}} in a Django/Jinja engine. Eventually, I want to interpolate __my.thing__ into the value of my.thing from the context (well, context['my']['thing']).
The IF/ENDIF block are self-evident I think.
And beyond the __something__ and the # IF/# ENDIF blocks I don't want to parse anything, it's a blob to me. The goal is to be completely agnostic to the text surrounding my blocks.
What I'm having a hard time wrapping my head around, is how to tell Lark that I'm just interested in things that match and I just want a "copy" of everything in-between.
I've tried something like that:
file: line "\n"
| line
line: line* "__" reference "__" line*
| TEXT
reference: KEY
| KEY "." reference
KEY: /[^\s\r\n\t.]+/
TEXT: /[^\n]+?/
But obviously it's a spectacular failure as it's just making one node of the tree for each letter of the text.
I'm grateful for any insight :)
Hi @Xowap ,
The reason you're getting one letter each time, is that you're using a non-greedy regexp in TEXT (+?) which means it only matches one character. You should make it greedy.
See this as an example: https://github.com/lark-parser/lark/blob/master/examples/advanced/conf_lalr.py
To avoid it "taking over" everything, you might need to put it in a low priority, like TEXT.-1. But that might not be necessary.
Thank you @erezsh !
There was many issues in my original grammar, I'm now at:
file: line+
line: line_content* "\n"
| line_content+
line_content: inline_ref
| TEXT
inline_ref: "__" reference "__"
reference: KEY ("." KEY)*
TEXT.-1: /[^\n]+/
KEY: /([^\s\r\n\t._]|_[^\s\r\n\t._])+/
My test is on this conceptual text input:
#!/usr/bin/python
import foo
import bar
import __project.name__
__project.name__
my_thing = "__project.NAME__"
Unfortunately I'm still getting bits like:
Token("RULE", "line"),
[
Tree(
Token("RULE", "line_content"),
[Token("TEXT", "import __project.name__")],
)
],
While I'd expect something more like:
Token("RULE", "line"),
[
Tree(
Token("RULE", "line_content"),
[
Token("TEXT", "import "),
Tree(
Token("RULE", "reference"),
[
Token("KEY", "project"),
Token("KEY", "name"),
],
)
],
)
],
So basically it doesn't look like the priority is observed? Or something is wrong in my expression somehow?
Thank you :)
As you say, "import" is TEXT, but TEXT doesn't stop at whitespace, it is greedy and matches to the end of the line.
If you want "import" to mean something, make a rule for it that includes that keyword.
Or make TEXT stop at whitespace. But I don't think that's the best way.
Well I don't want it to mean anything, I just want to replace values that are between __ and __.
Typically, the end game would be to detect things like:
blip blop bloop __value__ blap
And then get something like
- TEXT "blip blop bloop "
- reference "value"
- TEXT " blap"
Then I'll just compute the reference and leave the text untouched
Not sure if I'm explaining myself properly?
Then maybe you should make TEXT scan until it reaches __. You can use regex lookahead syntax for that.
That helped, thanks!
For further readers, here is my grammar in the end:
file: line+
line: line_content* "\n"
| line_content+
line_content: inline_ref
| TEXT
inline_ref: REF_DELIM reference REF_DELIM
reference: KEY ("__" KEY)*
TEXT.-1: /([^\n_]+|__(?!_)|_(?!__))/
REF_DELIM: /___/
KEY: /([a-zA-Z-])([a-zA-Z-0-9-]|_(?!_))*/