lark icon indicating copy to clipboard operation
lark copied to clipboard

Supplying a custom lexer is hard because of "assert issubclass"

Open gward opened this issue 4 years ago • 2 comments

Suggestion

My use case is a bit odd: I'm not parsing text, I'm parsing rows in a database.

Background: my employer provides network services to businesses (e.g. Internet access, dark fibre, VOIP, etc.). We sign contracts with our customers that specify the services we provide them. A contract consists of many line items, and each network service is specified by a sequence of line items.

For example, one Internet service might look like:

  • optional header line
  • one or more access lines ("Internet service at 123 Main Street")
  • one or more circuit lines (important to our network techs, and provides a unique ID for this service)
  • one optional "bandwidth" line
  • any number of optional "IP prefix" lines

You can see why a parser for a regular language is a good fit for this problem.

The lexer for this case is really simple: each line item (database row) becomes one token, where token.value is an object built from that line item. The lexer is completely trivial -- everything interesting happens in the parser.

Except ... for complicated reasons, I need control over how the lexer is instantiated. I don't want Lark to do it for me; I just want to pass in a lexer object. Or, I would happily pass in a callable that returns a lexer object -- if that's how Lark wants to do it, that's fine.

But I can't, because of this code in Lark.__init__():

        lexer = self.options.lexer
        if isinstance(lexer, type):
            assert issubclass(lexer, Lexer)     # XXX Is this really important? Maybe just ensure interface compliance
        else:
            assert_config(lexer, ('standard', 'contextual', 'dynamic', 'dynamic_complete'))

lexer must either be one of those strings, or it must be a class object. And that class must be a subclass of Lark's own Lexer.

I think changing that code to

        lexer = self.options.lexer
        if callable(lexer):
            # lexer will be called like
            #   lexer(lexer_conf)
            # and it must return something compatible with lark.lexer.Lexer.
            # The easiest way to accomplish this is to pass a subclass of
            # Lexer.
            pass
        else:
            assert_config(lexer, ('standard', 'contextual', 'dynamic', 'dynamic_complete'))

would work for me, and should be compatible.

Not tested though! Would like some feedback before I pursue this path.

Describe alternatives you've considered

Writing one custom subclass of Lexer is not a problem, and passing the class object, is not a problem. But now I need to have different lexer objects, depending on circumstances.

I think I can figure out something with different subclasses. But it will be ugly.

gward avatar Nov 30 '20 16:11 gward

Nope, this is not save. We assume a few things based on attributes of the Lexer class.

What you can simply do is create a class factory:

def create_lexer(name: str, arg_1, arg_2, arg_3):
    class SpecialLexer(Lexer): # (Or another more general class if you have common code for all lexer.
        def lex(self, stream):
              # Generate the tokens based on arg_1, arg_2, arg_3 or whatever
    SpecialLexer.__name__ = name # To make errors a little better (you should also assign __qualname__ to something useful
    return SpecialLexer

MegaIng avatar Nov 30 '20 18:11 MegaIng

@gward Hi Greg, I think the change you propose makes sense.

@MegaIng Why do you think it's such a bad idea?

Another alternative: You can subclass from Lexer and override the __new__ method. But that's a little backwards.

erezsh avatar Dec 02 '20 10:12 erezsh