lark
lark copied to clipboard
Lark.save does not work using the earley parser
When trying to do call Lark.save(f)
on an earley parser, this exception is thrown:
AttributeError: 'XEarley' object has no attribute '__serialize_fields__'
To reproduce:
from lark import Lark
grammar = """
sentence: noun verb noun -> simple
| noun verb "like" noun -> comparative
noun: adj? NOUN
verb: VERB
adj: ADJ
NOUN: "flies" | "bananas" | "fruit"
VERB: "like" | "flies"
ADJ: "fruit"
%import common.WS
%ignore WS
"""
parser = Lark(grammar, start='sentence', ambiguity='explicit', parser='earley')
with open('saved.pickle', 'wb') as f:
parser.save(f)
Using lalr
saves the parser as expected, but is no option in my use case. From the documentation and various issues (https://github.com/lark-parser/lark/issues/508#issuecomment-596068571, https://github.com/lark-parser/lark/issues/492#issuecomment-596068838) I assumed that parser saving is available regardless of parser type.
Nope, not supported.
Is this a planned feature? If not, could you please update the documentation to reflect this?
I don't know if it is planned, @erezsh has to answer that, but I don't think so (If I remember correctly, the parser state for earley is quite large). I also can't edit the docs.
If i'm not mistaken, Lark uses their own way of serializing/deserializing Lark objects
, right? Requiring the __serialize_fields__
and __serialize_namespace__
attributes, right?
https://github.com/lark-parser/lark/blob/460a221923317299422f14d87901dee1eb5f5fda/lark/utils.py#L43-L50
Yes
Lark.save
exists in order to save the parse table the LALR generates, because it's expensive to compute.
Earley doesn't perform any significant precomputation. It's entire parsing process is completely dynamic. So, there is really nothing to save. Even if I make the function "work", it will not store anything useful, and loading will be just as fast as creating a new Lark instance.
@MegaIng
I also can't edit the docs.
That's just not true. You can submit a PR, just like you often do with code.
@erezsh Yes, that is true. But I didn't want to create a PR for a minimal notice.
Probably the right thing to do here is not to change the docs, but to override XEarley.serialize to produce a nicer exception telling the user they are doing something wrong.
I would prefer if the documentation (where I learned about save
in the first place) told me right away that it is not available for the Earley parser. The cache
option, which uses this save
feature under the hood as far as I see it also has a short notice in the docs telling users that it's for LALR only.
Hello, I've been confronted with a related usage about XEarley parsers on this issue . Due to the time cost I've managed to use multiprocessing to parse the data, a lot of deep SQL queries. In practice, each SQL may cost 2~7 seconds to parse due to the large grammar (like the one borrowed from MySQL Workbench repo).
The problems are follows,
- A lark instance is failed to get pickled and sent to other processes (exceeding the pickle recursion limit perhaps due to some attribute like
self.parser = self
), so the serialization is crucial. - I try to manually wrap Lark with something like below.
class ParserWrapper:
def __init__(self, parser):
self.parser: lark.Lark = parser
def __getstate__(self):
data, m = self.parser.memo_serialize([lark.lexer.TerminalDef, lark.grammar.Rule])
return {'data': data, 'memo': m}
def __setstate__(self, state):
self.parser = lark.Lark.load(state)
The workaround still fails because the XEarley parser is not supported in this way, leading to an incorrect subscription error.
import lark
p = lark.Lark("""start: "a"*""")
p.save(open('/tmp/x', 'wb')) # successful
pp = lark.Lark.load(open('/tmp/x', 'rb')) # TypeError: 'Parser' object is not subscriptable
- For some large grammar file, MySQL.lark.gz, loading is also time-consuming. Creating a new Lark instance at each subprocess may not be a good solution.
So in a summary, the serialization of XEarley parsers, at least for the parsed rules and terminal defs, is a desirable and consistent feature.
Thanks for the reading.
@zxteloiv I don't think we considered gigantic Autogenerated Grammars when making these descisions xD. Yeah, we should make this work for earley then. (Also, I don't think there is a good reason why this shouldn't be pickable. I just think your grammar is so big that the default implementation fails)
Oh Yes! :) I also felt cumbersome working with the large grammar. For now I will change my code to reuse the parser of each worker process. Really appreciate your work!
@erezsh We did have a regression in #781 (e.g. now saving a earley parser now fails on loading, not on saving)
Hi @zxteloiv,
-
Pickle supports recursion, so I don't think that's the case. It's more likely that the grammar is just huge. You can manually set the recursion limit, and perhaps that's the easiest solution.
-
That's a problem. We should throw a user-friendly error in this case.
the serialization of XEarley parsers ... is a desirable and consistent feature.
No argument there. Someone just has to implement it.
@MegaIng Not sure what it means. I don't think save/load on the Earley parser ever worked.