cogcomp-nlp
cogcomp-nlp copied to clipboard
Tokenization: Dot in the middle of a word.
@Slash0BZ do you think this has been affected by your change in tokenization?

"E.coli" should be one token, I think.
Yeah the current logic of tokenizer on this issue is: when you meet a dot, and the next character is a alphabetic letter instead of a white space (which marks the end of a sentence), it will check if this next character is uppercased.
If the next character is uppercased, the dot is marked as part of the token, which is correct for most of the cases (U.S., U.K, etc.). Before my fix, the logic was wrong where if the next character is lowercased, the dot is marked as part of the token, and it is not the initial intention (see comments https://github.com/CogComp/cogcomp-nlp/blob/master/tokenizer/src/main/java/edu/illinois/cs/cogcomp/nlp/tokenizer/TokenizerStateMachine.java#L277-L283).
I think "E.coli" is s special case where the next character is lowercased but the dot should be part of the token. The before-fix tokenizer does correctly on this one "by mistake" since it is not a designed behavior.
However at a second thought, now I don't think we need to check the capitalization when we meet a alphabet character after a dot. How about we mark the dot part of the part at every cases when the character following a dot is alphabetic?
Didn't get this last point:
How about we mark the dot part of the part at every cases when the character following a dot is alphabetic?
I mean how about we change the rule to whenever the dot is followed by a alphabetic character, we mark the dot as part of the token?