lark
lark copied to clipboard
Found a gotcha with the interactive parser
parser = Lark(r"""
%ignore /[\t \f]+/ // WS
start: d|c
d: B
c: A B
A: "abc"
B: /[^\W\d]\w*/
""", parser="lalr")
input_str = 'abc'
interactive = parser.parse_interactive(input_str)
print(interactive.exhaust_lexer())
interactive.accepts()
The output is
[Token('A', 'abc')]
{'B'}
but if you modify the input str with a B token, then the output turns into:
[Token('B', 'abcfdsfds')]
{'$END'}
Is there a way to recognize that we need a whitespace separated B token and not just an immediate B?
The usecase here is to be able to generate the "abc fdsfdsfds" string properly, but to do this the upstream logic needs to know that not only does B token come next (which .accepts() does), but it needs a B token after a whitespace
No, there is not builtin way to deal with this, you will need to write custom stuff on a case by case basis. hypothesmith does this by more or less inserting a WS between any tokens (or at least any identifier tokens)
You could try to write a general system that goes back a token, tests if the lexing still does the same stuff with the new token and otherwise inserts an ignored token. But because the lexing with regex can already be an arbitrarily complex step, there is no good general solution. For just the common case of keywords and identifiers, it's a pretty easy workaround.