lark icon indicating copy to clipboard operation
lark copied to clipboard

Lark resulting LALR -nor- Earley parser will pick out the longest match of say "a + b + c + ... + d" example input. Bug (?)

Open FruitfulApproach opened this issue 1 year ago • 0 comments

Describe the bug

Normally when you create a calculator you work merely with binary operations and everything is parenthesized internally like so (unless they transform it like I think I will be doing to solve this): a + (b + c)

Now, Lark is thinking this is what we always want. When I try to match a finite number of additions, greater than two:

from lark import Lark, Transformer

KATEX_SUBSET_GRAMMAR = r"""
start : (RAW_STR | katex_formula)*
katex_formula : DDOLR content DDOLR | DOLR content DOLR
content : addition | variable | int_const
addition : content ("+" content)+
variable : text_var | atomic_var
atomic_var : GREEK | LATIN
text_var : TEXT_CMD "{" var_name "}"
var_name : (NAME | "-")+
int_const : INT
LATIN : /[a-zA-Z]/
GREEK : /\\alpha|\\beta|\\gamma|\\delta/
RAW_STR : /[^\$]+/
TEXT_CMD : /\\text|\\textbf/
DDOLR : /\$\$/
DOLR : /\$/
%import common.INT
%import python.NAME
%import common.WS
%ignore WS
"""

katex_parser = Lark(grammar=KATEX_SUBSET_GRAMMAR, parser='earley')

TEST_ENUM = False

if not __debug__ or TEST_ENUM:
    Op = 'o'
    Data = '@'
    Var, Concat, KatexBlock, Katex, Add = range(5)
else:
    Op = 'Op'; Data = 'Data'
    Var = 'Var';  Concat = 'Concat'; Add = 'Add'
    KatexBlock = '$$'; Katex = '$'
    
wrapping_ops = { Katex, KatexBlock }
infix_ops = { Concat }
        
class KaTeXtoJson(Transformer):     
    def start(self, tree):
        if len(tree) > 1:
            return self._concat(tree)
        return tree[0]
    
    def _concat(self, tree):
        return [Concat, tree]
    
    def variable(self, tree):
        print(tree)
        return tree
    
    def atomic_var(self, tree):
        print(tree)
        return tree
    
    def RAW_STR(self, tree):
        return tree.strip()
    
    def int_const(self, tree):
        return int(tree.data[0])
    
    def katex_formula(self, tree):
        return ''.join(tree.data)
    
    def atomic_var(self, tree):
        return tree[0]
    
    def variable(self, tree):
        return tree[0]
    
    def content(self, tree):
        return tree[0]
    
    def katex_formula(self, tree):
        if str(tree[0]) == '$':
            op = Katex
        else:
            op = KatexBlock
        return [op, tree[1]]
    
    def addition(self, tree):
        return [Add, list(tree)]        
    
if __name__ == '__main__':    
    katex_to_json = KaTeXtoJson()
    import json
    
    while True:
        user_input = input("(╯‵□′)╯︵┻━┻ ... : ")
        parse_tree = katex_parser.parse(user_input)        
        print("Parse tree (Before Xforming): ", parse_tree)
        
        # JSON test & TODO: ultimately test JSON object against 
        # assignment / saving / retrieval using Neomodel's JSONProperty
        json_hopefully = katex_to_json.transform(parse_tree)
        
        print("Result after Xforming: ", json_hopefully)
        
        try:
            did_it_work = json.dumps(json_hopefully)            
            print("It works! 😎 (Valid JSON) : ", did_it_work)
            
        except Exception as e:
            print(f"That didn't work 😭 (Invalid JSON) : {e}")

I use:

addition : content ("+" content)+ 

I.e. the most obvious / simple way to accomplish the above. However:

(╯‵□′)╯︵┻━┻ ... : $a + b + c + d$
Parse tree (Before Xforming):  Tree(Token('RULE', 'start'), [Tree(Token('RULE', 'katex_formula'), [Token('DOLR', '$'), Tree(Token('RULE', 'content'), [Tree(Token('RULE', 'addition'), [Tree(Token('RULE', 'content'), [Tree(Token('RULE', 'addition'), [Tree(Token('RULE', 'content'), [Tree(Token('RULE', 'variable'), [Tree(Token('RULE', 'atomic_var'), [Token('LATIN', 'a')])])]), Tree(Token('RULE', 'content'), [Tree(Token('RULE', 'variable'), [Tree(Token('RULE', 'atomic_var'), [Token('LATIN', 'b')])])])])]), Tree(Token('RULE', 'content'), [Tree(Token('RULE', 'addition'), [Tree(Token('RULE', 'content'), [Tree(Token('RULE', 'variable'), [Tree(Token('RULE', 'atomic_var'), [Token('LATIN', 'c')])])]), Tree(Token('RULE', 'content'), [Tree(Token('RULE', 'variable'), [Tree(Token('RULE', 'atomic_var'), [Token('LATIN', 'd')])])])])])])]), Token('DOLR', '$')])])
Result after Xforming:  ['$', ['Add', [['Add', [Token('LATIN', 'a'), Token('LATIN', 'b')]], ['Add', [Token('LATIN', 'c'), Token('LATIN', 'd')]]]]]
It works! 😎 (Valid JSON) :  ["$", ["Add", [["Add", ["a", "b"]], ["Add", ["c", "d"]]]]]
(╯‵□′)╯︵┻━┻ ... : 

So, what I would expect to see is:

["Add", ["a", "b", "c", "d"]] inside the last part, but instead it thinks the user means (a + b) + (c + d).

Now I have a user code solution to this: Convert to the format I want in the addition() transformer method. It checks if either side is addition and if so, blends everything together.

However, a Lark-side solution would be much cleaner code. Or am I doing something incorrect, and this is NOT a bug?

FruitfulApproach avatar Sep 04 '24 11:09 FruitfulApproach