libfsm icon indicating copy to clipboard operation
libfsm copied to clipboard

lx doesn't always return the longest match

Open jameysharp opened this issue 5 years ago • 7 comments

Given this input to lx,

"." -> $dot;
"..." -> $ellipsis;

running the generated -l dump program against the string .. (no trailing newline or other characters) outputs:

lx_next: Invalid argument
0-2:1,1-3: 

Running it against ..\n outputs:

0-3:1-2,1: lexically uncategorised: '..'

In both cases I expected to see two <DOT '.'> tokens, followed by either <EOF> or an error that the newline character is lexically uncategorised, respectively.

I'm pretty sure language specifications like this require more than one character of lookahead/backtracking. An error transition needs to walk back to and return the most recent accepting state, which may be more than one transition ago.

I think a straightforward graph analysis can tell you the upper bound on the number of characters you might need to buffer. For each accepting state in the DFA, find all states that are reachable from it without hitting another accepting state. If the resulting subgraph does not contain cycles, then its maximum path length is one less than the necessary amount of lookahead. If it does contain cycles then it can require unbounded lookahead.

If I have this right, then lx_ungetc isn't sufficient in general, but at least you can detect at compile time when it isn't enough. You could even report example strings which take an error transition more than one step after the last accepting state.

jameysharp avatar Feb 20 '19 05:02 jameysharp

After trying out #116 on all the existing lx examples, that script's output showed me that the literals.lx example also has this problem. I think these are the shortest inputs which illustrate the issue for that language:

$ echo -n '0.' | ./literals 
0-2:1,1-3: <FLOAT '0.'>
0-2:1,1-3: <EOF>

$ echo -n '0.e' | ./literals 
lx_next: Invalid argument
0-3:1,1-4: 

jameysharp avatar Feb 22 '19 04:02 jameysharp

I'm pretty sure language specifications like this require more than one character of lookahead/backtracking. An error transition needs to walk back to and return the most recent accepting state, which may be more than one transition ago.

That's correct, and it isn't something I'd considered when designing the generated code. I think it should be possible, but I don't like the possibility for the unbounded case. Hm. I'm not sure what I want to do here. I do agree that it'd be a nice feature (read: less surprising behaviour).

I'm calling this a feature request rather than a bug, because the current code generation didn't attempt to cater for this.

katef avatar Mar 09 '19 22:03 katef

What do you think of having lx identify this situation and erroring about it? Much like it does for ambiguous patterns (where they share a common intersection). In this case I think we'd strengthen that test to error if a pattern one-or-more-times shares an intersection with another pattern.

katef avatar Mar 17 '19 15:03 katef

Oh, is "one-or-more-times shares" an equivalent way of describing this? I could believe that. Interesting!

I think that always returning an error would be unfortunate. For ambiguous patterns, it's always possible to rewrite the input to lx to eliminate the ambiguity, right? I think that's never possible for longest-match problems while still matching the same tokens, so always reporting an error limits what languages lx can work with. (I'm pretty sure the one I care about right now, for the C11 preprocessor, can't be rewritten to avoid this.)

I'd rather see three code-gen options which resemble the token-buffer options.

  • One is what's implemented now, with exactly one character of lookahead, and that should report an error if more lookahead is required.
  • Another would be a fixed-size lookahead buffer, which should report an error if some path requires unbounded lookahead, but otherwise should allocate exactly the size of buffer that the language requires.
  • Finally there'd be a dynamically resizable buffer option, which would always succeed.

That said... so long as the error is only reported by the C codegen option, and GraphViz/JSON output succeeds even for languages which need more lookahead, I guess I don't personally care?

jameysharp avatar Mar 17 '19 16:03 jameysharp

I'd be very uncomfortable if .lx source were used to generate lexers which behaved differently, for different implementations.

I think I agree that the dynamic resizing buffer does make sense. Dynamic allocation is something I wanted to avoid for FSM, but lx does already do that, because tokens themselves are variable lengths.

katef avatar Mar 22 '19 01:03 katef

I'm not proposing that any lexer specification behave differently on different implementations. I'm only observing that some implementations are less capable than others, in exchange for having simpler and/or more efficient implementations.

If two lx backends are both capable of implementing a particular .lx specification, then I agree that they should both match the same language, but I think it's perfectly reasonable to have implementations that can't handle some .lx specifications, so long as a user who tries to use such a specification with such a backend gets an appropriate error instead of getting a broken lexer.

Anyway, I think this is the same situation as for token buffers, because lx doesn't support token buffers at all by default, and allows the user to request either a fixed-size or a dynamically-growing buffer if they need that extra capability.

jameysharp avatar Mar 22 '19 03:03 jameysharp

I think I want to keep the tokens which did match, when moving past them (in this case .. ). Each would be replaced in turn when a new longer candidate is found.

We still need to buffer the spelling to use for the next call, but this would tell us which was the longest match that we moved past, before finding a failure.

katef avatar Apr 04 '19 14:04 katef