jflex icon indicating copy to clipboard operation
jflex copied to clipboard

EOF lookahead?

Open AngledLuffa opened this issue 4 years ago • 3 comments

Are there any plans to do lookahead that can see EOF? There's a bug in one of Stanford NLP group's tokenization tools which would be easily fixed by such an ability.

https://github.com/stanfordnlp/CoreNLP/issues/1161

The fundamental issue is that our rule to tokenize "gonna" etc looks like this:

{ASSIMILATIONS2}/[^\p{Alpha}]

so basically it's taking words like "gonna" that aren't followed by more text without whitespace, such as "i'm gonnaeatallmycopiesofmoxopaloutofanger"

Years ago I made a similar request and was told that the easiest solution would be to add an extra whitespace to the end of the text being processed, but I'm hoping something more elegant than that will be available soon. Thanks!

https://sourceforge.net/p/jflex/mailman/jflex-users/thread/CAHaU7mb%3DV98ApE80B%2BGyBUf8%2BsPg4KOenqY3O%3DXbkvONpH2utA%40mail.gmail.com/#msg32027415

AngledLuffa avatar May 27 '21 06:05 AngledLuffa

Being able to see when you're at EOF seems a fairly fundamental unavailable lookahead capability!

manning avatar Mar 13 '22 15:03 manning

The best way to do this is still for the user to choose what extra character they want at the end.

That's the only thing the scanning engine can do either if you want it as part of the DFA/regular expression, but the scanning engine has no way of knowing which character is suitable as an extra character for EOF for the kinds input that are expected (whitespace certainly is not going to be a good choice for some, but it may be just fine for others). We could make it configurable etc, but it seems a lot simpler to just append something to the input on the user side.

lsf37 avatar Mar 13 '22 21:03 lsf37

Just to clarify: EOF isn't an input character, and the engine matches input characters, so to change this you either need something like the above (basically a hack) or you need to change something fairly fundamental.

lsf37 avatar Mar 13 '22 21:03 lsf37

Closing this, because I think this is best handled adding to the input, not by changing the scanning engine.

lsf37 avatar Jan 05 '23 05:01 lsf37