lark
lark copied to clipboard
Lark resulting LALR -nor- Earley parser will pick out the longest match of say "a + b + c + ... + d" example input. Bug (?)
Describe the bug
Normally when you create a calculator you work merely with binary operations and everything is parenthesized internally like so (unless they transform it like I think I will be doing to solve this): a + (b + c)
Now, Lark is thinking this is what we always want. When I try to match a finite number of additions, greater than two:
from lark import Lark, Transformer
KATEX_SUBSET_GRAMMAR = r"""
start : (RAW_STR | katex_formula)*
katex_formula : DDOLR content DDOLR | DOLR content DOLR
content : addition | variable | int_const
addition : content ("+" content)+
variable : text_var | atomic_var
atomic_var : GREEK | LATIN
text_var : TEXT_CMD "{" var_name "}"
var_name : (NAME | "-")+
int_const : INT
LATIN : /[a-zA-Z]/
GREEK : /\\alpha|\\beta|\\gamma|\\delta/
RAW_STR : /[^\$]+/
TEXT_CMD : /\\text|\\textbf/
DDOLR : /\$\$/
DOLR : /\$/
%import common.INT
%import python.NAME
%import common.WS
%ignore WS
"""
katex_parser = Lark(grammar=KATEX_SUBSET_GRAMMAR, parser='earley')
TEST_ENUM = False
if not __debug__ or TEST_ENUM:
Op = 'o'
Data = '@'
Var, Concat, KatexBlock, Katex, Add = range(5)
else:
Op = 'Op'; Data = 'Data'
Var = 'Var'; Concat = 'Concat'; Add = 'Add'
KatexBlock = '$$'; Katex = '$'
wrapping_ops = { Katex, KatexBlock }
infix_ops = { Concat }
class KaTeXtoJson(Transformer):
def start(self, tree):
if len(tree) > 1:
return self._concat(tree)
return tree[0]
def _concat(self, tree):
return [Concat, tree]
def variable(self, tree):
print(tree)
return tree
def atomic_var(self, tree):
print(tree)
return tree
def RAW_STR(self, tree):
return tree.strip()
def int_const(self, tree):
return int(tree.data[0])
def katex_formula(self, tree):
return ''.join(tree.data)
def atomic_var(self, tree):
return tree[0]
def variable(self, tree):
return tree[0]
def content(self, tree):
return tree[0]
def katex_formula(self, tree):
if str(tree[0]) == '$':
op = Katex
else:
op = KatexBlock
return [op, tree[1]]
def addition(self, tree):
return [Add, list(tree)]
if __name__ == '__main__':
katex_to_json = KaTeXtoJson()
import json
while True:
user_input = input("(╯‵□′)╯︵┻━┻ ... : ")
parse_tree = katex_parser.parse(user_input)
print("Parse tree (Before Xforming): ", parse_tree)
# JSON test & TODO: ultimately test JSON object against
# assignment / saving / retrieval using Neomodel's JSONProperty
json_hopefully = katex_to_json.transform(parse_tree)
print("Result after Xforming: ", json_hopefully)
try:
did_it_work = json.dumps(json_hopefully)
print("It works! 😎 (Valid JSON) : ", did_it_work)
except Exception as e:
print(f"That didn't work 😭 (Invalid JSON) : {e}")
I use:
addition : content ("+" content)+
I.e. the most obvious / simple way to accomplish the above. However:
(╯‵□′)╯︵┻━┻ ... : $a + b + c + d$
Parse tree (Before Xforming): Tree(Token('RULE', 'start'), [Tree(Token('RULE', 'katex_formula'), [Token('DOLR', '$'), Tree(Token('RULE', 'content'), [Tree(Token('RULE', 'addition'), [Tree(Token('RULE', 'content'), [Tree(Token('RULE', 'addition'), [Tree(Token('RULE', 'content'), [Tree(Token('RULE', 'variable'), [Tree(Token('RULE', 'atomic_var'), [Token('LATIN', 'a')])])]), Tree(Token('RULE', 'content'), [Tree(Token('RULE', 'variable'), [Tree(Token('RULE', 'atomic_var'), [Token('LATIN', 'b')])])])])]), Tree(Token('RULE', 'content'), [Tree(Token('RULE', 'addition'), [Tree(Token('RULE', 'content'), [Tree(Token('RULE', 'variable'), [Tree(Token('RULE', 'atomic_var'), [Token('LATIN', 'c')])])]), Tree(Token('RULE', 'content'), [Tree(Token('RULE', 'variable'), [Tree(Token('RULE', 'atomic_var'), [Token('LATIN', 'd')])])])])])])]), Token('DOLR', '$')])])
Result after Xforming: ['$', ['Add', [['Add', [Token('LATIN', 'a'), Token('LATIN', 'b')]], ['Add', [Token('LATIN', 'c'), Token('LATIN', 'd')]]]]]
It works! 😎 (Valid JSON) : ["$", ["Add", [["Add", ["a", "b"]], ["Add", ["c", "d"]]]]]
(╯‵□′)╯︵┻━┻ ... :
So, what I would expect to see is:
["Add", ["a", "b", "c", "d"]] inside the last part, but instead it thinks the user means (a + b) + (c + d).
Now I have a user code solution to this: Convert to the format I want in the addition() transformer method. It checks if either side is addition and if so, blends everything together.
However, a Lark-side solution would be much cleaner code. Or am I doing something incorrect, and this is NOT a bug?