lark
lark copied to clipboard
Lalr parser raises UnexpectedToken('$END', ...) rather than UnexpectedEOF
Describe the bug
When an input is exhausted, the earley parser raises lark.errors.UnexpectedEOF(...), while the lalr parser raises lark.errors.UnexpectedToken('$END', ...).
For consistency sake, in lalr parsers, if the error raised from an unexpected token is '$END' it should be re-raised as UnexpectedEOF.
Some extra context
I am building an application that requires parsing a stream, and I had switched to the (much faster) lalr parser, but as my stream may require assembling several 'chunks' to create a valid record, I was catching UnexpectedEOF from earley, but now I have to catch UnexpectedToken and drill into the error to check the token:
except lark.exceptions.UnexpectedToken as err:
if err.token == lark.Token("$END", ""):
logger.debug("Parser expected more data, waiting for another chunk")
else:
raise err
To Reproduce
import sys, lark
print(f"python: {sys.version_info}\nlark: {lark.__version__}\n\n")
grammar = 'start: "A" ~ 4' # 4 sequential A's
try:
lark.Lark(grammar, parser="earley").parse("AA")
except Exception as err:
print("Earley err:", type(err), *err.args)
try:
lark.Lark(grammar, parser="lalr").parse("AA")
except Exception as err:
print("Lalr err:", type(err), *err.args)
Output
python: sys.version_info(major=3, minor=8, micro=5, releaselevel='final', serial=0)
lark: 0.11.1
Earley err: <class 'lark.exceptions.UnexpectedEOF'> Unexpected end-of-input. Expected one of:
* A
Lalr err: <class 'lark.exceptions.UnexpectedToken'> Unexpected token Token('$END', '') at line 1, column 2.
Expected one of:
* A
Note that this is something that might break compatibility. This is something we have in mind, and I think we also agree that it would be better for both parser to throw the same exception. (Note that this includes the possiblity of making the earley parser throw UnexpectedToken . But you are making a decent case to keeping UnexpectedEOF).
While this is certainly a good change, this might only happen in 1.0. (or we temporary make UnexpectedEOF behave like an UnexpectedToken. But that seems a bit hacky.)
Yeah this is definitely a breaking change either way, as the different exception types can change the control flow of a program. You've seen my use case, so I would prefer both parsers to raise UnexpectedEOF. That said, there's easy workarounds here until 1.0 lands.
Thanks for the great library!
Yes, consistancy would make error-catching much easier