PTBTokenizer Unrecognizable: (U+2063, decimal: 8291)
PTBTokenizer crashed on this unicode character (U+2063, decimal: 8291) which is an invisible comma/separator, and threw this error:
Untokenizable: (U+2063, decimal: 8291) Exception in thread “main” java.lang.ArithmeticException: integer overflow at java.lang.Math.toIntExact(Math.java:1011) at edu.stanford.nlp.process.PTBLexer.getNext(PTBLexer.java) at edu.stanford.nlp.process.PTBLexer.next(PTBLexer.java) at edu.stanford.nlp.process.PTBTokenizer.getNext(PTBTokenizer.java:301) at edu.stanford.nlp.process.PTBTokenizer.getNext(PTBTokenizer.java:185) at edu.stanford.nlp.process.AbstractTokenizer.hasNext(AbstractTokenizer.java:69) at edu.stanford.nlp.process.PTBTokenizer.tokReader(PTBTokenizer.java:493) at edu.stanford.nlp.process.PTBTokenizer.tok(PTBTokenizer.java:464) at edu.stanford.nlp.process.PTBTokenizer.main(PTBTokenizer.java:890)
Also tried using -filter \u2063 and threw the same error
I think these are two separate issues & it's just a coincidence that the errors are printed this way.
-
We could figure out something to do with the "invisible comma". Currently "untokenizable" characters should just be dropped. You can test this by tokenizing a small file with such a character; it should not crash.
-
The documentation very clearly says that it will properly handle files longer than Integer.MAX_VALUE characters long, and then instead it just crashes. Oops.
https://github.com/stanfordnlp/CoreNLP/blob/f05cb54ec0a4f3c90395771817f44a81eb549baf/src/edu/stanford/nlp/process/PTBLexer.flex#L483
I suggest only tokenizing files of less than 2GB until we figure it out
Just to confirm regarding the invisible separator:
java edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize -file foo.txt
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
Processing file /home/john/CoreNLP/foo.txt ... writing to /home/john/CoreNLP/foo.txt.out
Untokenizable: (U+2063, decimal: 8291)
Annotating file /home/john/CoreNLP/foo.txt ... done [0.1 sec].
Annotation pipeline timing information:
TokenizerAnnotator: 0.1 sec.
TOTAL: 0.1 sec. for 3 tokens at 56.6 tokens/sec.
Pipeline setup: 0.0 sec.
Total time for StanfordCoreNLP pipeline: 0.2 sec.
cat foo.txt.out
Document: ID=foo.txt (1 sentences, 3 tokens)
Sentence #1 (3 tokens):
Unbanmoxopal
Tokens:
[Text=Unban CharacterOffsetBegin=0 CharacterOffsetEnd=5]
[Text=mox CharacterOffsetBegin=6 CharacterOffsetEnd=9]
[Text=opal CharacterOffsetBegin=10 CharacterOffsetEnd=14]
Note that Chrome turns the invisible separator into a visible space, so it's proving quite difficult to paste the example file here. You can get as many invisible separators as you need here, though