cogcomp-nlp icon indicating copy to clipboard operation
cogcomp-nlp copied to clipboard

StatefulTokenizer weirdness

Open borice opened this issue 7 years ago • 0 comments

Hello.

TL;DR; StatefulTokenizer tokenizes the date "10/23/2018" as [ "10", "/", "23/2018" ] whereas IllinoisTokenizer (which seems to be deprecated) keeps it as a single token [ "10/23/2018" ].

Longer version: I'm seeing some unexpected sentence detection/tokenization of a simple (contrived) text. Here's an example (in Scala, but can easily be translated to Java)

 val textEN = """One two, three-four-five 10/23/2018 at 5:20pm one? Of course not! Be well, stranger. Bye-bye!"""
 val textAnnotationBuilder = new TokenizerTextAnnotationBuilder(new StatefulTokenizer(false))
 val textAnnotation = textAnnotationBuilder.createTextAnnotation(text)
 val sentences = textAnnotation.sentences().asScala.toList
 val sentenceTokens = sentences.map(_.getTokens)

Printing the sentences and sentence tokens I get:

SENTENCES:

  1. One two , three-four-five 10 / 23/2018 at 5:20pm one ?
  2. Of course not !
  3. Be well , stranger .
  4. Bye-bye !

SENTENCE TOKENS:

  1. One | two | , | three-four-five | 10 | / | 23/2018 | at | 5:20pm | one | ?
  2. Of | course | not | !
  3. Be | well | , | stranger | .
  4. Bye-bye | !

Problem 1 The problem is in the date "10/23/2018" -- it has no spaces in the original text, but somehow the sentence output added some spaces "10 / 23/2018". Each sentence was printed with Sentence.toString() method. If I use the Sentence.getText() method instead, the sentence is printed correctly with no extra spaces in the date.

Problem 2 This seems related to above. As you can notice, the sentence tokens also seem to split the date in a weird way --- which seems to correspond to the extra spaces issue described above.

If instead of StatefulTokenizer I use IllinoisTokenizer (now seemingly deprecated), everything works as it should. Both Sentence.toString() and Sentence.getText() output the same thing, and the sentence tokens do not have the extra-space problem.

Please advise.

Thank you, Boris

borice avatar May 23 '18 17:05 borice