CoreNLP PTBTokenizer Unrecognizable: (U+2063, decimal: 8291)

PTBTokenizer crashed on this unicode character (U+2063, decimal: 8291) which is an invisible comma/separator, and threw this error:

Untokenizable: ⁣ (U+2063, decimal: 8291) Exception in thread “main” java.lang.ArithmeticException: integer overflow at java.lang.Math.toIntExact(Math.java:1011) at edu.stanford.nlp.process.PTBLexer.getNext(PTBLexer.java) at edu.stanford.nlp.process.PTBLexer.next(PTBLexer.java) at edu.stanford.nlp.process.PTBTokenizer.getNext(PTBTokenizer.java:301) at edu.stanford.nlp.process.PTBTokenizer.getNext(PTBTokenizer.java:185) at edu.stanford.nlp.process.AbstractTokenizer.hasNext(AbstractTokenizer.java:69) at edu.stanford.nlp.process.PTBTokenizer.tokReader(PTBTokenizer.java:493) at edu.stanford.nlp.process.PTBTokenizer.tok(PTBTokenizer.java:464) at edu.stanford.nlp.process.PTBTokenizer.main(PTBTokenizer.java:890)

Also tried using -filter \u2063 and threw the same error

Jul 01 '22 22:07 rlyCarlson

I think these are two separate issues & it's just a coincidence that the errors are printed this way.

We could figure out something to do with the "invisible comma". Currently "untokenizable" characters should just be dropped. You can test this by tokenizing a small file with such a character; it should not crash.
The documentation very clearly says that it will properly handle files longer than Integer.MAX_VALUE characters long, and then instead it just crashes. Oops.

https://github.com/stanfordnlp/CoreNLP/blob/f05cb54ec0a4f3c90395771817f44a81eb549baf/src/edu/stanford/nlp/process/PTBLexer.flex#L483

Jul 02 '22 00:07 AngledLuffa

I suggest only tokenizing files of less than 2GB until we figure it out

Jul 02 '22 00:07 AngledLuffa

Just to confirm regarding the invisible separator:

java edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize -file foo.txt
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize

Processing file /home/john/CoreNLP/foo.txt ... writing to /home/john/CoreNLP/foo.txt.out
Untokenizable: ⁣ (U+2063, decimal: 8291)
Annotating file /home/john/CoreNLP/foo.txt ... done [0.1 sec].

Annotation pipeline timing information:
TokenizerAnnotator: 0.1 sec.
TOTAL: 0.1 sec. for 3 tokens at 56.6 tokens/sec.
Pipeline setup: 0.0 sec.
Total time for StanfordCoreNLP pipeline: 0.2 sec.

cat foo.txt.out
Document: ID=foo.txt (1 sentences, 3 tokens)

Sentence #1 (3 tokens):
Unban⁣mox⁣opal

Tokens:
[Text=Unban CharacterOffsetBegin=0 CharacterOffsetEnd=5]
[Text=mox CharacterOffsetBegin=6 CharacterOffsetEnd=9]
[Text=opal CharacterOffsetBegin=10 CharacterOffsetEnd=14]

Note that Chrome turns the invisible separator into a visible space, so it's proving quite difficult to paste the example file here. You can get as many invisible separators as you need here, though

Jul 02 '22 00:07 AngledLuffa