interegular icon indicating copy to clipboard operation
interegular copied to clipboard

[Bug] found a bug with handling choices

Open RevanthRameshkumar opened this issue 2 years ago • 8 comments

fsm = interegular.parse_pattern("(not(?=\s)|not(?=\()|-)").to_fsm()
assert fsm .accepts('-') == True
assert fsm .accepts('--') == False

Both of the above fail. I expect the first to succeed and the second to fail, because the final state for the second should end at the first '-'.

In contrast:

re.match('(not(?=\s)|not(?=\()|-)', '-') 
re.match('(not(?=\s)|not(?=\()|-)', '--') 

both of the above succeed (but the span for the second one is still (0, 1)

RevanthRameshkumar avatar Oct 01 '23 04:10 RevanthRameshkumar

Oh it is because of the lookaheads, this works:

'(\not|not\(|-)'

RevanthRameshkumar avatar Oct 01 '23 04:10 RevanthRameshkumar

This library isn't really designed for direct consumption, for that greenery is a better choice.

What you are seeing is in fact that expected behavior. Having a lookahead makes the regex ->FSM transformer add extra padding at the end that is always required, since that is necessary when trying to figure out the overlap of regex.

MegaIng avatar Oct 01 '23 09:10 MegaIng

Greenery actually doesn't parse lookaheads from when I tried it last, so your library is more useful to me. Kudos on the parse and fsm class...very useful!

RevanthRameshkumar avatar Oct 01 '23 15:10 RevanthRameshkumar

Yes, grennery doesn't handle lookaheads. Those are really hard and problematic to include in a FSM and you better understand why before using them. Adding lookahead support in a way that works for the specific case of intersection checking is the reason I basically copyied over greeneries internals. If you want to do more than that, greenery is the better library to use.

MegaIng avatar Oct 01 '23 15:10 MegaIng

Is there more info on why they are hard? When I do 'not(?=\s)' your lib actually produces the exact fsm I need. Actually I don't care about intersection...the fsm generation is what is important to me!

RevanthRameshkumar avatar Oct 01 '23 15:10 RevanthRameshkumar

lookaheads that go beyond the end of the string are... weird to deal with, and especially non-fixed-length lookbehinds aren't regular. I am unsure if non-fixed-length lookaheads are regular. My library deals with those, but potentially not correctly.

What do you want to use the FSMs for?

MegaIng avatar Oct 01 '23 15:10 MegaIng

I'm using it to generate a lark grammar compatible string by picking one token at a time (where the tokens are treated as atomic at any length)

RevanthRameshkumar avatar Oct 01 '23 16:10 RevanthRameshkumar

Lark only lets you see next atomic terminals, but no sub terminal options. Combining with fsm lets you go sub terminal.

RevanthRameshkumar avatar Oct 01 '23 16:10 RevanthRameshkumar