lark
lark copied to clipboard
Supplying a custom lexer is hard because of "assert issubclass"
Suggestion
My use case is a bit odd: I'm not parsing text, I'm parsing rows in a database.
Background: my employer provides network services to businesses (e.g. Internet access, dark fibre, VOIP, etc.). We sign contracts with our customers that specify the services we provide them. A contract consists of many line items, and each network service is specified by a sequence of line items.
For example, one Internet service might look like:
- optional header line
- one or more access lines ("Internet service at 123 Main Street")
- one or more circuit lines (important to our network techs, and provides a unique ID for this service)
- one optional "bandwidth" line
- any number of optional "IP prefix" lines
You can see why a parser for a regular language is a good fit for this problem.
The lexer for this case is really simple: each line item (database row) becomes one token, where token.value
is an object built from that line item. The lexer is completely trivial -- everything interesting happens in the parser.
Except ... for complicated reasons, I need control over how the lexer is instantiated. I don't want Lark to do it for me; I just want to pass in a lexer object. Or, I would happily pass in a callable that returns a lexer object -- if that's how Lark wants to do it, that's fine.
But I can't, because of this code in Lark.__init__()
:
lexer = self.options.lexer
if isinstance(lexer, type):
assert issubclass(lexer, Lexer) # XXX Is this really important? Maybe just ensure interface compliance
else:
assert_config(lexer, ('standard', 'contextual', 'dynamic', 'dynamic_complete'))
lexer
must either be one of those strings, or it must be a class object. And that class must be a subclass of Lark's own Lexer.
I think changing that code to
lexer = self.options.lexer
if callable(lexer):
# lexer will be called like
# lexer(lexer_conf)
# and it must return something compatible with lark.lexer.Lexer.
# The easiest way to accomplish this is to pass a subclass of
# Lexer.
pass
else:
assert_config(lexer, ('standard', 'contextual', 'dynamic', 'dynamic_complete'))
would work for me, and should be compatible.
Not tested though! Would like some feedback before I pursue this path.
Describe alternatives you've considered
Writing one custom subclass of Lexer is not a problem, and passing the class object, is not a problem. But now I need to have different lexer objects, depending on circumstances.
I think I can figure out something with different subclasses. But it will be ugly.
Nope, this is not save. We assume a few things based on attributes of the Lexer class.
What you can simply do is create a class factory:
def create_lexer(name: str, arg_1, arg_2, arg_3):
class SpecialLexer(Lexer): # (Or another more general class if you have common code for all lexer.
def lex(self, stream):
# Generate the tokens based on arg_1, arg_2, arg_3 or whatever
SpecialLexer.__name__ = name # To make errors a little better (you should also assign __qualname__ to something useful
return SpecialLexer
@gward Hi Greg, I think the change you propose makes sense.
@MegaIng Why do you think it's such a bad idea?
Another alternative: You can subclass from Lexer and override the __new__
method. But that's a little backwards.