cogcomp-nlp
cogcomp-nlp copied to clipboard
StatefulTokenizer weirdness
Hello.
TL;DR; StatefulTokenizer tokenizes the date "10/23/2018" as [ "10", "/", "23/2018" ] whereas IllinoisTokenizer (which seems to be deprecated) keeps it as a single token [ "10/23/2018" ].
Longer version: I'm seeing some unexpected sentence detection/tokenization of a simple (contrived) text. Here's an example (in Scala, but can easily be translated to Java)
val textEN = """One two, three-four-five 10/23/2018 at 5:20pm one? Of course not! Be well, stranger. Bye-bye!"""
val textAnnotationBuilder = new TokenizerTextAnnotationBuilder(new StatefulTokenizer(false))
val textAnnotation = textAnnotationBuilder.createTextAnnotation(text)
val sentences = textAnnotation.sentences().asScala.toList
val sentenceTokens = sentences.map(_.getTokens)
Printing the sentences and sentence tokens I get:
SENTENCES:
- One two , three-four-five 10 / 23/2018 at 5:20pm one ?
- Of course not !
- Be well , stranger .
- Bye-bye !
SENTENCE TOKENS:
- One | two | , | three-four-five | 10 | / | 23/2018 | at | 5:20pm | one | ?
- Of | course | not | !
- Be | well | , | stranger | .
- Bye-bye | !
Problem 1
The problem is in the date "10/23/2018" -- it has no spaces in the original text, but somehow the sentence output added some spaces "10 / 23/2018". Each sentence was printed with Sentence.toString() method. If I use the Sentence.getText() method instead, the sentence is printed correctly with no extra spaces in the date.
Problem 2 This seems related to above. As you can notice, the sentence tokens also seem to split the date in a weird way --- which seems to correspond to the extra spaces issue described above.
If instead of StatefulTokenizer I use IllinoisTokenizer (now seemingly deprecated), everything works as it should. Both Sentence.toString() and Sentence.getText() output the same thing, and the sentence tokens do not have the extra-space problem.
Please advise.
Thank you, Boris