lark
lark copied to clipboard
Generated standalone parsers are not always identical for the same input
Given a fixed grammar file to generate a standalone parser using lark.tools.standalone
, the resulting generated code is not always identical despite setting PYTHONHASHSEED.
Note that the syntax tree given by the parser is fine. This is just about the generated code not being static. People who try to generate a standalone parser and commit the code to their repository will find diff just by regenerating the parser, even if the grammar file has not changed.
To Reproduce
Grammar grammar.lark
:
?start: item1* item2
item1: "a"
item2: "b"
Script to generate standalone parser:
export PYTHONHASHSEED=1
python -m lark.tools.standalone grammar.lark > file1.py
python -m lark.tools.standalone grammar.lark > file2.py
diff file1.py file2.py
Ideally, the diff should report nothing. However the diff almost always reports some differences. It seems those differences always appear in DATA
and MEMO
.
lark-parser version: 0.8.5 Python version: 3.8 OS version: Mac OSX 10.15
Ah, a simpler grammar file would also reproduce this:
start: item
item: "a"
But the differences occur less often.
Can not reproduce when PYTHONHASHSEED
is set. Without (or with =random
, or with -R
flag), can reproduce.
(On windows 10 with Python 3.7 & 3.8)
This can happen because the hashes change between runs (which PYTHONHASHSEED should solve) but also because the id() function returns different values. That causes certain operations to occur at arbitrary order.
I don't consider this a bug, but I will accept a PR that fixes it.
Probably the easiest fix would be sorting the culprits during serialization, but that might not be enough. Possibly some of the id()
calls will have to be replaced with more deliberate indexing.
Meanwhile I can recommend only generating the standalone parser if the file is modified (can be tested by date or hash of grammar).