lark icon indicating copy to clipboard operation
lark copied to clipboard

Lark.save does not work using the earley parser

Open wysiwyng opened this issue 4 years ago • 15 comments

When trying to do call Lark.save(f) on an earley parser, this exception is thrown: AttributeError: 'XEarley' object has no attribute '__serialize_fields__'

To reproduce:

from lark import Lark

grammar = """
    sentence: noun verb noun        -> simple
            | noun verb "like" noun -> comparative
    noun: adj? NOUN
    verb: VERB
    adj: ADJ
    NOUN: "flies" | "bananas" | "fruit"
    VERB: "like" | "flies"
    ADJ: "fruit"
    %import common.WS
    %ignore WS
"""

parser = Lark(grammar, start='sentence', ambiguity='explicit', parser='earley')
with open('saved.pickle', 'wb') as f:
    parser.save(f)

Using lalr saves the parser as expected, but is no option in my use case. From the documentation and various issues (https://github.com/lark-parser/lark/issues/508#issuecomment-596068571, https://github.com/lark-parser/lark/issues/492#issuecomment-596068838) I assumed that parser saving is available regardless of parser type.

wysiwyng avatar Dec 01 '20 14:12 wysiwyng

Nope, not supported.

MegaIng avatar Dec 01 '20 14:12 MegaIng

Is this a planned feature? If not, could you please update the documentation to reflect this?

wysiwyng avatar Dec 01 '20 14:12 wysiwyng

I don't know if it is planned, @erezsh has to answer that, but I don't think so (If I remember correctly, the parser state for earley is quite large). I also can't edit the docs.

MegaIng avatar Dec 01 '20 14:12 MegaIng

If i'm not mistaken, Lark uses their own way of serializing/deserializing Lark objects, right? Requiring the __serialize_fields__ and __serialize_namespace__ attributes, right?

https://github.com/lark-parser/lark/blob/460a221923317299422f14d87901dee1eb5f5fda/lark/utils.py#L43-L50

ThatXliner avatar Dec 01 '20 16:12 ThatXliner

Yes

MegaIng avatar Dec 01 '20 16:12 MegaIng

Lark.save exists in order to save the parse table the LALR generates, because it's expensive to compute.

Earley doesn't perform any significant precomputation. It's entire parsing process is completely dynamic. So, there is really nothing to save. Even if I make the function "work", it will not store anything useful, and loading will be just as fast as creating a new Lark instance.

erezsh avatar Dec 02 '20 10:12 erezsh

@MegaIng

I also can't edit the docs.

That's just not true. You can submit a PR, just like you often do with code.

erezsh avatar Dec 02 '20 10:12 erezsh

@erezsh Yes, that is true. But I didn't want to create a PR for a minimal notice.

MegaIng avatar Dec 02 '20 10:12 MegaIng

Probably the right thing to do here is not to change the docs, but to override XEarley.serialize to produce a nicer exception telling the user they are doing something wrong.

erezsh avatar Dec 02 '20 10:12 erezsh

I would prefer if the documentation (where I learned about save in the first place) told me right away that it is not available for the Earley parser. The cache option, which uses this save feature under the hood as far as I see it also has a short notice in the docs telling users that it's for LALR only.

wysiwyng avatar Dec 02 '20 14:12 wysiwyng

Hello, I've been confronted with a related usage about XEarley parsers on this issue . Due to the time cost I've managed to use multiprocessing to parse the data, a lot of deep SQL queries. In practice, each SQL may cost 2~7 seconds to parse due to the large grammar (like the one borrowed from MySQL Workbench repo).

The problems are follows,

  1. A lark instance is failed to get pickled and sent to other processes (exceeding the pickle recursion limit perhaps due to some attribute like self.parser = self), so the serialization is crucial.
  2. I try to manually wrap Lark with something like below.
class ParserWrapper:
    def __init__(self, parser):
        self.parser: lark.Lark = parser

    def __getstate__(self):
        data, m = self.parser.memo_serialize([lark.lexer.TerminalDef, lark.grammar.Rule])
        return {'data': data, 'memo': m}

    def __setstate__(self, state):
        self.parser = lark.Lark.load(state)

The workaround still fails because the XEarley parser is not supported in this way, leading to an incorrect subscription error.

import lark
p = lark.Lark("""start: "a"*""")
p.save(open('/tmp/x', 'wb'))     # successful
pp = lark.Lark.load(open('/tmp/x', 'rb'))  # TypeError: 'Parser' object is not subscriptable
  1. For some large grammar file, MySQL.lark.gz, loading is also time-consuming. Creating a new Lark instance at each subprocess may not be a good solution.

So in a summary, the serialization of XEarley parsers, at least for the parsed rules and terminal defs, is a desirable and consistent feature.

Thanks for the reading.

zxteloiv avatar May 26 '21 09:05 zxteloiv

@zxteloiv I don't think we considered gigantic Autogenerated Grammars when making these descisions xD. Yeah, we should make this work for earley then. (Also, I don't think there is a good reason why this shouldn't be pickable. I just think your grammar is so big that the default implementation fails)

MegaIng avatar May 26 '21 09:05 MegaIng

Oh Yes! :) I also felt cumbersome working with the large grammar. For now I will change my code to reuse the parser of each worker process. Really appreciate your work!

zxteloiv avatar May 26 '21 10:05 zxteloiv

@erezsh We did have a regression in #781 (e.g. now saving a earley parser now fails on loading, not on saving)

MegaIng avatar May 26 '21 10:05 MegaIng

Hi @zxteloiv,

  1. Pickle supports recursion, so I don't think that's the case. It's more likely that the grammar is just huge. You can manually set the recursion limit, and perhaps that's the easiest solution.

  2. That's a problem. We should throw a user-friendly error in this case.

the serialization of XEarley parsers ... is a desirable and consistent feature.

No argument there. Someone just has to implement it.

@MegaIng Not sure what it means. I don't think save/load on the Earley parser ever worked.

erezsh avatar May 26 '21 12:05 erezsh