lark icon indicating copy to clipboard operation
lark copied to clipboard

Embedded `Transformer` doesn't raise `VisitError`

Open stefanb2 opened this issue 4 months ago • 12 comments

Thanks for this useful package. I was positively surprised how fast I went from the need for a code generator with a simple DSL to a script that parses source files and starts to throw out expected code bits.

Side note: I wish the common terminals would be documented. I spent a lot of time in common.lark to figure out the hidden treasures in there....

Describe the bug

A Transformer embedded into a Lark instance with transformer= does not raise VisitError.

I'm not sure if this is intended behavior or a bug. If it is intended behavior then IMHO it should be added to VisitorError and Lark()'s transformer= parameter documentation.

IMHO embedded Transformer should raise VisitError, because if I understood the documentation correctly it is called immediately after a successful sub-parse. Therefore VisitError could be improved to provide the parse context, i.e. it would be possible to create more user friendly error messages for semantic errors. As the code works now I have to write the transformer error message that include hard-coded grammar bits.

To Reproduce

Tested with lark == 1.2.2

This stripped down excerpt from my code works as expected, i.e. exceptions raised in MyTransformer code are wrapped as VisitError:

  def __init__(self, transformer: MyTransformer):
    self._parser      = Lark(
      # NOTE: I left all options in, but I'm sure they are not relevant for the issue
      grammar             = _GRAMMAR,
      lexer               = 'basic',
      maybe_placeholders  = False,
      parser              = 'lalr',
      propagate_positions = False,
      strict              = True,
    )
    self._transformer = transformer

  def parse(self, contents: str):
    try:
      tree = self._parser.parse(contents)
      self._transformer.transform(tree)

    except UnexpectedInput:
      raise RuntimeError("SYNTAX ERROR")

    except VisitError:
      raise RuntimeError("SEMANTIC ERROR")

This never raises a VisitError, i.e. an exception is passed upwards as-is.

  def __init__(self, transformer: MyTransformer):
    self._parser      = Lark(
      # ... same as above ...
      transformer         = transformer,
    )

  def parse(self, contents: str):
    try:
      tree = self._parser.parse(contents)

    except UnexpectedInput:
      raise RuntimeError("SYNTAX ERROR")

    except VisitError:
      raise RuntimeError("SEMANTIC ERROR")

stefanb2 avatar Aug 17 '25 07:08 stefanb2

I'm glad you found it useful! I'm always in favor of improving the documentation. If you think some information should be added, you're more than welcome to submit a PR. (just from past experience, better to keep it small, and not modify big chunks of existing documentation)

Re the VisitError, if you could provide a minimal example that reproduces this behavior, with the transformer code, that would be helpful.

erezsh avatar Aug 17 '25 08:08 erezsh

Oh, sorry, I was expecting that someone familiar with the implementation would immediately understand the difference in the example code.

Here is the smallest complete reproducer I could come up with:

from lark            import Lark
from lark.exceptions import UnexpectedInput, VisitError
from lark.visitors   import Transformer

GRAMMAR = r"""
  start: INT

  %import common (INT)
"""

TEST = "100"

class MyTransformer(Transformer):
  def start(self, tree):
    assert False, "something went wrong in Transformer"

try:
  parser = Lark(
    grammar     = GRAMMAR,
    parser      = 'lalr',
    strict      = True,
  )
  transformer = MyTransformer()

  tree = parser.parse(TEST)
  transformer.transform(tree)

except UnexpectedInput as e:
  print(f"not embedded: SYNTAX ERROR - {e}")

except VisitError as e:
  print(f"not embedded: SEMANTIC ERROR - {e}")

except Exception as e:
  print(f"not embedded: UNKNOWN ERROR - {e}")

print()

try:
  parser = Lark(
    grammar     = GRAMMAR,
    parser      = 'lalr',
    strict      = True,
    transformer = MyTransformer()
  )

  parser.parse(TEST)

except UnexpectedInput as e:
  print(f"embedded: SYNTAX ERROR - {e}")

except VisitError as e:
  print(f"embedded: SEMANTIC ERROR - {e}")

except Exception as e:
  print(f"embedded: UNKNOWN ERROR - {e}")

Test output:

$ python3 reproducer.py 
not embedded: SEMANTIC ERROR - Error trying to process rule "start":

something went wrong in Transformer

embedded: UNKNOWN ERROR - something went wrong in Transformer

If the above is intended behavior then it would mean that Transformer implementation needs to be different if it is embedded or not. IMHO not the best interface choice.

stefanb2 avatar Aug 17 '25 14:08 stefanb2

Thanks, now I understand.

Yes, it's a discrepancy in the behavior. It's not ideal.

Luckily, I think it's pretty easy to fix:

class CompatibleTransformer(Transformer):
    def transform(self, tree):
        try:
            return super().transform(tree)
        except VisitError as e:
            raise e.orig_exc

class MyTransformer(CompatibleTransformer):
    ...

Give it a try! If it works well, we can add it to the recipe page.

(Or maybe even add it as a new transformer. But changing the existing one would break compatibility)

erezsh avatar Aug 17 '25 15:08 erezsh

Unfortunately your suggestion doesn't work: it hides VisitError also in the non-embedded use case.

This is exactly the opposite what this PR is about. To quote the documentation:

VisitError is raised when visitors are interrupted by an exception

which is not the case when you use Lark(..., transformer= ...)

stefanb2 avatar Aug 17 '25 19:08 stefanb2

it hides VisitError also in the non-embedded use case.

That is what it was meant to do

This is exactly the opposite what this PR is about

So far users have mostly complained about having to unwrap VisitError. Are you saying that you consider it an important feature?

erezsh avatar Aug 18 '25 08:08 erezsh

So far users have mostly complained about having to unwrap VisitError. Are you saying that you consider it an important feature?

Yes, because it allows to generate better error messages, which results in a better UX.

See my reproducer code: it can separate syntax from semantic from other errors.

It is also my understanding (please correct me if I'm wrong) that the parser calls the embedded transformer every time it completes a terminal/rule and not only once after the whole tree has been created like in the non-embedded use case. IMHO that should allow to enrich VisitError with parser context, i.e. you could generate even better error messages that include source snippets, similar to what is already now possible with syntax errors.

stefanb2 avatar Aug 18 '25 20:08 stefanb2

it can separate syntax from semantic from other errors.

I think the following code can do the same, unless I'm missing something:

try:
   p.parse(...)
except UnexpectedInput as e:
  print(f"embedded: SYNTAX ERROR - {e}")

except LarkError as e:
  print(f"embedded: UNKNOWN ERROR - {e}")

except Exception as e:
  print(f"embedded: SEMANTIC ERROR - {e}")

IMHO that should allow to enrich VisitError with parser context

Interesting thought. I'll look into it.

erezsh avatar Aug 18 '25 21:08 erezsh

Dooooh, I should have thought to use type() on the exception myself.

BUT your suggestion doesn't seem to work. I updated my reproducer to

from lark.exceptions import LarkError, UnexpectedInput, VisitError
...
try:
  parser = Lark(
    grammar     = GRAMMAR,
    parser      = 'lalr',
    strict      = True,
    # NOTE: embedded Transformer generates LarkError, not VisitError!
    transformer = MyTransformer()
  )

  parser.parse(TEST)

except UnexpectedInput as e:
  print(f"embedded: SYNTAX ERROR - ({type(e)}) {e}")

#except VisitError as e:
except LarkError as e:
  print(f"embedded: SEMANTIC ERROR - ({type(e)}) {e}")

except Exception as e:
  print(f"embedded: UNKNOWN ERROR - ({type(e)}) {e}")

and still get an assertion exception:

$ ./visit-error-reproducer.py 
not embedded: SEMANTIC ERROR - (<class 'lark.exceptions.VisitError'>) Error trying to process rule "start":

something went wrong in Transformer

embedded: UNKNOWN ERROR - (<class 'AssertionError'>) something went wrong in Transformer

stefanb2 avatar Aug 19 '25 05:08 stefanb2

Another problem with using LarkError for an embedded Transformer is that you wouldn't be able distinguish between developer error (e.g. incorrect grammar) and user error (e.g. semantic error). OK, that would be a minor issue, because it would only affect me as developer of the script, not the user running the production version of the script.

not embedded: UNKNOWN ERROR - (<class 'lark.exceptions.GrammarError'>) Rule 'INT2' used but not defined (in rule start)

embedded: SEMANTIC ERROR - (<class 'lark.exceptions.GrammarError'>) Rule 'INT2' used but not defined (in rule start)

stefanb2 avatar Aug 19 '25 05:08 stefanb2

Look at my last code again, it switches the order. LarkError = unknown error (for your code), while the generic Exception means it has to be a "semantic error", i.e. coming from the transformer, since we already caught all the exceptions Lark can legally throw.

You can break LarkError into smaller components too, like ParseError, ConfigurationError, etc. https://github.com/lark-parser/lark/blob/master/lark/exceptions.py/

erezsh avatar Aug 19 '25 06:08 erezsh

Look at my last code again, it switches the order. LarkError = unknown error (for your code), while the generic Exception means it has to be a "semantic error", i.e. coming from the transformer, since we already caught all the exceptions Lark can legally throw.

OK, I missed that.

But it doesn't really help in the real code generator, because there the parser instance creation and parsing of the source files are two separate steps. I.e. on parsing I can't separate between semantic errors and other errors, because I can only catch Exception to catch errors from Transformer.

I made the change to the code, but IMHO it is now much less clear than before.

stefanb2 avatar Aug 19 '25 07:08 stefanb2

on parsing I can't separate between semantic errors and other errors

Not sure what you mean. Lark.parse() can only throw two kinds of exceptions. LarkError or your errors. (anything else is a bug, afaik)

erezsh avatar Aug 19 '25 07:08 erezsh