cogcomp-nlp icon indicating copy to clipboard operation
cogcomp-nlp copied to clipboard

Tokenization: Dot in the middle of a word.

Open danyaljj opened this issue 8 years ago • 3 comments

@Slash0BZ do you think this has been affected by your change in tokenization?

screen shot 2017-09-08 at 9 47 17 am

"E.coli" should be one token, I think.

danyaljj avatar Sep 08 '17 16:09 danyaljj

Yeah the current logic of tokenizer on this issue is: when you meet a dot, and the next character is a alphabetic letter instead of a white space (which marks the end of a sentence), it will check if this next character is uppercased.

If the next character is uppercased, the dot is marked as part of the token, which is correct for most of the cases (U.S., U.K, etc.). Before my fix, the logic was wrong where if the next character is lowercased, the dot is marked as part of the token, and it is not the initial intention (see comments https://github.com/CogComp/cogcomp-nlp/blob/master/tokenizer/src/main/java/edu/illinois/cs/cogcomp/nlp/tokenizer/TokenizerStateMachine.java#L277-L283).

I think "E.coli" is s special case where the next character is lowercased but the dot should be part of the token. The before-fix tokenizer does correctly on this one "by mistake" since it is not a designed behavior.

However at a second thought, now I don't think we need to check the capitalization when we meet a alphabet character after a dot. How about we mark the dot part of the part at every cases when the character following a dot is alphabetic?

Slash0BZ avatar Sep 08 '17 17:09 Slash0BZ

Didn't get this last point:

How about we mark the dot part of the part at every cases when the character following a dot is alphabetic?

danyaljj avatar Sep 08 '17 18:09 danyaljj

I mean how about we change the rule to whenever the dot is followed by a alphabetic character, we mark the dot as part of the token?

Slash0BZ avatar Sep 08 '17 18:09 Slash0BZ