lark icon indicating copy to clipboard operation
lark copied to clipboard

Generated standalone parsers are not always identical for the same input

Open kitchoi opened this issue 4 years ago • 3 comments

Given a fixed grammar file to generate a standalone parser using lark.tools.standalone, the resulting generated code is not always identical despite setting PYTHONHASHSEED.

Note that the syntax tree given by the parser is fine. This is just about the generated code not being static. People who try to generate a standalone parser and commit the code to their repository will find diff just by regenerating the parser, even if the grammar file has not changed.

To Reproduce

Grammar grammar.lark:

?start: item1* item2

item1: "a"
item2: "b"

Script to generate standalone parser:

export PYTHONHASHSEED=1
python -m lark.tools.standalone grammar.lark > file1.py
python -m lark.tools.standalone grammar.lark > file2.py
diff file1.py file2.py

Ideally, the diff should report nothing. However the diff almost always reports some differences. It seems those differences always appear in DATA and MEMO.

lark-parser version: 0.8.5 Python version: 3.8 OS version: Mac OSX 10.15

kitchoi avatar May 08 '20 12:05 kitchoi

Ah, a simpler grammar file would also reproduce this:

start: item
item: "a"

But the differences occur less often.

kitchoi avatar May 08 '20 12:05 kitchoi

Can not reproduce when PYTHONHASHSEED is set. Without (or with =random, or with -R flag), can reproduce.

(On windows 10 with Python 3.7 & 3.8)

MegaIng avatar May 08 '20 13:05 MegaIng

This can happen because the hashes change between runs (which PYTHONHASHSEED should solve) but also because the id() function returns different values. That causes certain operations to occur at arbitrary order.

I don't consider this a bug, but I will accept a PR that fixes it.

Probably the easiest fix would be sorting the culprits during serialization, but that might not be enough. Possibly some of the id() calls will have to be replaced with more deliberate indexing.

Meanwhile I can recommend only generating the standalone parser if the file is modified (can be tested by date or hash of grammar).

erezsh avatar May 08 '20 13:05 erezsh