EOF lookahead?
Are there any plans to do lookahead that can see EOF? There's a bug in one of Stanford NLP group's tokenization tools which would be easily fixed by such an ability.
https://github.com/stanfordnlp/CoreNLP/issues/1161
The fundamental issue is that our rule to tokenize "gonna" etc looks like this:
{ASSIMILATIONS2}/[^\p{Alpha}]
so basically it's taking words like "gonna" that aren't followed by more text without whitespace, such as "i'm gonnaeatallmycopiesofmoxopaloutofanger"
Years ago I made a similar request and was told that the easiest solution would be to add an extra whitespace to the end of the text being processed, but I'm hoping something more elegant than that will be available soon. Thanks!
https://sourceforge.net/p/jflex/mailman/jflex-users/thread/CAHaU7mb%3DV98ApE80B%2BGyBUf8%2BsPg4KOenqY3O%3DXbkvONpH2utA%40mail.gmail.com/#msg32027415
Being able to see when you're at EOF seems a fairly fundamental unavailable lookahead capability!
The best way to do this is still for the user to choose what extra character they want at the end.
That's the only thing the scanning engine can do either if you want it as part of the DFA/regular expression, but the scanning engine has no way of knowing which character is suitable as an extra character for EOF for the kinds input that are expected (whitespace certainly is not going to be a good choice for some, but it may be just fine for others). We could make it configurable etc, but it seems a lot simpler to just append something to the input on the user side.
Just to clarify: EOF isn't an input character, and the engine matches input characters, so to change this you either need something like the above (basically a hack) or you need to change something fairly fundamental.
Closing this, because I think this is best handled adding to the input, not by changing the scanning engine.