lark icon indicating copy to clipboard operation
lark copied to clipboard

Found a gotcha with the interactive parser

Open RevanthRameshkumar opened this issue 2 years ago • 3 comments

parser = Lark(r"""
%ignore /[\t \f]+/  // WS

start: d|c
d: B
c: A B
A: "abc"
B: /[^\W\d]\w*/
""", parser="lalr")


input_str = 'abc'
interactive = parser.parse_interactive(input_str)
print(interactive.exhaust_lexer())
interactive.accepts()

The output is

[Token('A', 'abc')]
{'B'}

but if you modify the input str with a B token, then the output turns into:

[Token('B', 'abcfdsfds')]
{'$END'}

Is there a way to recognize that we need a whitespace separated B token and not just an immediate B?

RevanthRameshkumar avatar Oct 04 '23 20:10 RevanthRameshkumar

The usecase here is to be able to generate the "abc fdsfdsfds" string properly, but to do this the upstream logic needs to know that not only does B token come next (which .accepts() does), but it needs a B token after a whitespace

RevanthRameshkumar avatar Oct 04 '23 20:10 RevanthRameshkumar

No, there is not builtin way to deal with this, you will need to write custom stuff on a case by case basis. hypothesmith does this by more or less inserting a WS between any tokens (or at least any identifier tokens)

MegaIng avatar Oct 04 '23 20:10 MegaIng

You could try to write a general system that goes back a token, tests if the lexing still does the same stuff with the new token and otherwise inserts an ignored token. But because the lexing with regex can already be an arbitrarily complex step, there is no good general solution. For just the common case of keywords and identifiers, it's a pretty easy workaround.

MegaIng avatar Oct 04 '23 20:10 MegaIng