lark icon indicating copy to clipboard operation
lark copied to clipboard

Nondeterministic Earley parse

Open adkabo opened this issue 4 years ago • 4 comments

Describe the bug

On Lark 0.8.5, Python 3.7.7, I'm observing a parse affected by PYTHONHASHSEED.

A clear and concise description of what the bug is, and what you expected to happen.

The parse should not depend on PYTHONHASHSEED.

To Reproduce

import lark



GRAMMAR = r"""
unicode_letter: /[^\W\d]/
unicode_digit: /\d/

letter : unicode_letter | "_"

decimal_digit : /\d/

identifier : [ "#" | "_#" ] letter ( letter | unicode_digit )*

structlit       : "{" ( declaration )* [ "..." ] "}"
declaration     : field | embedding | attribute
embedding       : aliasexpr
field           : label ":" ( label ":" )* expression attribute*
label           : identifier ("=" labelexpr)*
labelexpr       : labelname [ "?" ] | "[" aliasexpr "]"
labelname       : identifier

attribute       : "@" identifier "(" attr_tokens ")"
attr_tokens     : ( attr_token | "(" attr_tokens ")" | "[" attr_tokens "]" | "{" attr_tokens "}" )*
attr_token      : /[^\(\)\[\]\{\}\s]+/

aliasexpr  : expression | identifier "=" expression

int_lit : /\d+/

operand     : literal | operandname | "(" expression ")"
literal     : basiclit  | structlit
basiclit    : int_lit
operandname : identifier

sourcefile : ( declaration )*

primaryexpr : operand |  primaryexpr arguments


argument       : expression
arguments      : "(" [ ( argument ( "," argument )* ) [ "," ] ] ")"


expression : unaryexpr
unaryexpr  : primaryexpr

start : sourcefile

%import common.WS
%import common.CNAME

%ignore WS
"""

text = '''
x: 1
y: 2
z : {
  a : 10
  b : 20
  cd : 30
}
'''


parser = lark.Lark(GRAMMAR, parser='earley')
tree = parser.parse(text)
print(tree.pretty())
$ diff <(PYTHONHASHEED=4 python3 example.py) <(PYTHONHASHEED=5 python3 example.py)
67a68,78
>                       embedding
>                         aliasexpr
>                           expression
>                             unaryexpr
>                               primaryexpr
>                                 operand
>                                   operandname
>                                     identifier
>                                       letter
>                                         unicode_letter	c
>                     declaration
71,72d81
<                             letter
<                               unicode_letter	c

adkabo avatar May 20 '20 02:05 adkabo

Your grammar is ambiguous. You can check that by passing ambiguity='explicit' to the constructor and the parsing the text again. Although it would be nice, I don't think Lark gives any guarantees on what ambiguity is selected.

MegaIng avatar May 20 '20 10:05 MegaIng

I think it would be good for Lark to be consistent per hashseed. But it's not a high priority in my eyes.

You can "force" the parser to choose one derivation before all others by giving it a higher priority than the rest.

Or use explicit ambiguity and choose it yourself, as MegaIng suggested.

Also, if you could find a minimal example, that would increase the chances that I will fix it.

erezsh avatar May 20 '20 11:05 erezsh

The non-deterministic results create significant issues when testing and debugging an intentionally highly ambiguous grammar. The only possible cause I can see of this is iterating through Python sets. Maybe it's possible to remove the dependency of iterating through sets from the parser code? E.g. by creating a tuple alongside a set when iteration is needed.

To give you the context: I'm using Lark to parse a semi-natural language: chat requests from traders requesting prices of financial derivatives with complex parameters. The interpretation of the chat strings can be highly ambiguous and can't all be expressed in the grammar. It also depends on the real-world context such as current date and some market data. The tree of ambiguities from Earley is a great way for me to simplify the algo processing and choose the correct interpretation.

When some of my thousands of tests fail, I need to debug the logic. The order of the ambiguities when executed in debug mode is often different from running without the debugger. It also changes depending on the order of the tests. In combination with the fact that sometimes Earley does not return all ambiguities (partially fixed in #1216 - thanks! but there's also #536 open and I occasionally see some variants missing in my tests), this makes testing and debugging very complicated in some cases.

It might be hard for me to provide a minimal example. It's also a bit difficult to reproduce the non-deterministic results. Maybe I should try with PYTHONHASHSEED when I see it next time.

yurymann avatar Nov 27 '22 22:11 yurymann

@yurymann Why not always run it with a fixed PYTHONHASHSEED? That way if anything goes wrong, it's always reproducible.

erezsh avatar Nov 28 '22 02:11 erezsh