lark icon indicating copy to clipboard operation
lark copied to clipboard

Standalone module generation output varies

Open bcr opened this issue 3 years ago • 11 comments

Describe the bug

When generating output using the minimal grammar of start: (as well as more complex grammars, but this reproduces it in my environment) the output varies. On cursory examination running several times, there appear to be two variants that differ by the states that are generated.

To Reproduce

echo "start:" > test.g
python3 -m lark.tools.standalone test.g > test.py
python3 -m lark.tools.standalone test.g > test1.py
python3 -m lark.tools.standalone test.g > test2.py
python3 -m lark.tools.standalone test.g > test3.py
diff test.py test1.py
diff test.py test2.py
diff test.py test3.py

The diff output on my machine is summarized below.

--- test.py	2022-09-22 23:17:36.000000000 -0700
+++ test1.py	2022-09-22 23:17:39.000000000 -0700
@@ -3091,7 +3091,7 @@
 
 import pickle, zlib, base64
 DATA = (
-{'parser': {'lexer_conf': {'terminals': [], 'ignore': [], 'g_regex_flags': 0, 'use_bytes': False, 'lexer_type': 'contextual', '__type__': 'LexerConf'}, 'parser_conf': {'rules': [{'@': 0}], 'start': ['start'], 'parser_type': 'lalr', '__type__': 'ParserConf'}, 'parser': {'tokens': {0: 'start', 1: '$END'}, 'states': {0: {0: (0, 1), 1: (1, {'@': 0})}, 1: {}}, 'start_states': {'start': 0}, 'end_states': {'start': 1}}, '__type__': 'ParsingFrontend'}, 'rules': [{'@': 0}], 'options': {'debug': False, 'keep_all_tokens': False, 'tree_class': None, 'cache': False, 'postlex': None, 'parser': 'lalr', 'lexer': 'contextual', 'transformer': None, 'start': ['start'], 'priority': 'normal', 'ambiguity': 'auto', 'regex': False, 'propagate_positions': False, 'lexer_callbacks': {}, 'maybe_placeholders': False, 'edit_terminals': None, 'g_regex_flags': 0, 'use_bytes': False, 'import_paths': [], 'source_path': None, '_plugins': {}}, '__type__': 'Lark'}
+{'parser': {'lexer_conf': {'terminals': [], 'ignore': [], 'g_regex_flags': 0, 'use_bytes': False, 'lexer_type': 'contextual', '__type__': 'LexerConf'}, 'parser_conf': {'rules': [{'@': 0}], 'start': ['start'], 'parser_type': 'lalr', '__type__': 'ParserConf'}, 'parser': {'tokens': {0: 'start', 1: '$END'}, 'states': {0: {}, 1: {0: (0, 0), 1: (1, {'@': 0})}}, 'start_states': {'start': 1}, 'end_states': {'start': 0}}, '__type__': 'ParsingFrontend'}, 'rules': [{'@': 0}], 'options': {'debug': False, 'keep_all_tokens': False, 'tree_class': None, 'cache': False, 'postlex': None, 'parser': 'lalr', 'lexer': 'contextual', 'transformer': None, 'start': ['start'], 'priority': 'normal', 'ambiguity': 'auto', 'regex': False, 'propagate_positions': False, 'lexer_callbacks': {}, 'maybe_placeholders': False, 'edit_terminals': None, 'g_regex_flags': 0, 'use_bytes': False, 'import_paths': [], 'source_path': None, '_plugins': {}}, '__type__': 'Lark'}
 )
 MEMO = (
 {0: {'origin': {'name': Token('RULE', 'start'), '__type__': 'NonTerminal'}, 'expansion': [], 'order': 0, 'alias': None, 'options': {'keep_all_tokens': False, 'expand1': False, 'priority': None, 'template_source': None, 'empty_indices': (), '__type__': 'RuleOptions'}, '__type__': 'Rule'}}

bcr avatar Sep 23 '22 06:09 bcr

Yes, I believe this happens because Lark uses sets, which change their order randomly based on python's hash-seed.

I don't consider this a bug, per se, but I will accept a PR that fixes it. (if the fix is reasonable)

erezsh avatar Sep 23 '22 06:09 erezsh

Alternatively, you can run Lark with a constant PYTHONHASHSEED - https://docs.python.org/3.4/using/cmdline.html#envvar-PYTHONHASHSEED

erezsh avatar Sep 23 '22 07:09 erezsh

In my case, I have a script that generates the module and I check the module into git, so I tend to run the generation script periodically to make sure I didn't change the grammar. I could be an adult and only generate the module when the grammar changes of course. Thanks for the PYTHONHASHSEED suggestion though. I think I'll just get over it and do a better job on my end.

bcr avatar Sep 23 '22 07:09 bcr

I'm trying to help someone integrate Lark-js into a repo, and one of our requirements is that generated files have a CI check that verifies that the generated file is up to date. For lark-js, this is "generate the file, check if there are differences between the generated file and the checked in version". With Lark being non-deterministic, this check is impossible.

andyleap avatar Sep 29 '23 23:09 andyleap

When you tried the PYTHONHASHSEED suggestion what happened?

bcr avatar Sep 29 '23 23:09 bcr

that eliminates... some of the randomness... it seems like the remaining is due to the memoization stuff

andyleap avatar Sep 29 '23 23:09 andyleap

just as a PoC, replacing the lark-js generation stuff with this:

def generate_js_standalone(lark_inst):
    """Returns a string containing the Javascript standalone parser, for the given Lark instance

    """
    if lark_inst.options.parser != 'lalr':
        raise NotImplementedError("Lark.js only works with LALR parsers for now")

    data, memo = lark_inst.memo_serialize([TerminalDef, Rule])

    remapped_memo = [i for i in range(len(memo))]
    remapped_memo.sort(key=lambda i: json.dumps(memo[i]))

    def walk(data, f):
        data = f(data)
        if isinstance(data, list):
            return [walk(i, f) for i in data]
        elif isinstance(data, dict):
            return {k: walk(v, f) for k, v in data.items()}
        else:
            return data

    def remap(v):
        if isinstance(v, dict):
            if '@' in v:
                v['@'] = remapped_memo.index(v['@'])
        return v

    data = walk(data, remap)
    memo = {i: memo[remapped_memo[i]] for i in range(len(memo))}

    data_json = json.dumps(data, indent=2)
    memo_json = json.dumps(memo, indent=2)

    with open(__dir__ / 'lark.js') as lark_js:
        output = lark_js.read()
        output += '\nvar DATA=%s;\n' % data_json
        output += '\nvar MEMO=%s;\n' % memo_json
    return output

seems to remove the rest of the randomness, but... that's probably the wrong place to mess with the memoization stuff?

andyleap avatar Sep 30 '23 00:09 andyleap

Where exactly does the randomness come from with a fixed PYTHONHASHSEED? memoization should only be random because of dict/set ordering.

MegaIng avatar Sep 30 '23 02:09 MegaIng

There might still be randomness with a fixed PYTHONHASHSEED if we rely on id() for hashing, which might happen inadvertently since it's the default behavior for Python objects.

erezsh avatar Sep 30 '23 05:09 erezsh

Just ran into this as well, any updates? I'd be happy to open a PR if you have any pointers :)

RobertCraigie avatar Feb 04 '24 12:02 RobertCraigie

I don't know if this is still an issue or not. But we introduced an OrderedSet implementation in Lark. So a solution would be to simply use it instead of sets everywhere.

erezsh avatar Mar 29 '24 11:03 erezsh