lark icon indicating copy to clipboard operation
lark copied to clipboard

Extra optional symbol affect on choice between two regexps

Open ObjatieGroba opened this issue 3 years ago • 4 comments
trafficstars

Should extra optional symbol affect on choice between two regexps?

Lets check example:

import lark

parser = lark.Lark('''
?start: s
%ignore /[\\n]/+

R1: /[\\w\\+]/+
R2: /(\\w)/+

SPACE: " "

space: (SPACE+)?

s: ("$" space R1) | ("@" space R2)
''', parser='lalr')

for example in ('$x',
                '@x',
                '$ x',
                '@ y'):
    print(example)
    print(parser.parse(example).pretty())

Both $ and @ parse correctly without space before R1 and R2.

The 4th example raise exception: Unexpected token Token('R1', 'y') at line 1, column 3. Expected one of: * R2.

Is it a bug (there are no compile exception) or feature? If feature, how to fix that?

ObjatieGroba avatar May 31 '22 21:05 ObjatieGroba

Replacing space with any other symbol (for example ".") leads to the same result

ObjatieGroba avatar May 31 '22 21:05 ObjatieGroba

As I suppose

Feature described at docs can look only through previous token, isn't it?

That's why before $ and @ it have the only one choice, when after space it is possible both of regexps.

ObjatieGroba avatar May 31 '22 21:05 ObjatieGroba

This can't really be avoided. Contextual lexer doesn't quite do what you want, it's still limited by the LALR parser. If you want to do this kind of stuff, either make sure that R1 and R2 don't conflict or use parser='earley'

MegaIng avatar May 31 '22 21:05 MegaIng

Thank you, @MegaIng

It is something that is not quite intuitive (combining this parsing cases into one pool of tokens).

How can I be sure that each two of my regexps does not meet at some parsing point (including recursive cases)?

It should be for lark to have flag that warn about any crossing regexps(

For example for s: ("$" R1) | ("@" "$" R2) lalr separate regexps successfully

ObjatieGroba avatar May 31 '22 21:05 ObjatieGroba