ritajs-v2
ritajs-v2 copied to clipboard
Diacritical characters inconsistency with tokenize()
I quite like the (?) undocumented convention that _ (underscore) allows input text to generate tokens containing spaces, but there is inconsistency when the string preceding the underscore contains a character outside the [a-zA-Z] range:
RT.tokenize("a_là"); // -> ["a là"]
RT.tokenize("a_la"); // -> ["a la"]
RT.tokenize("à_la"); // -> ["à_la"]
RT.tokenize("la_bas"); // -> ["la bas"]
RT.tokenize("lá_bas"); // -> ["lá_bas"]
@KarlieZhao can you take a look at this?
@KarlieZhao can you take a look at this?
sure, I'll look into it
when generating tokens from strings with '_', the tokenizer recognizes a pattern of the underscore and the single letter before and after it. In the above PR, the range of letters are expanded to include from U00C0 to U00FF, in addition to [A-Za-z]. Tokenizer should now be able to handle most latin letters, including letters with tildes or accent marks, and numbers.
One problem I think there might is, however, is that an email address containing '_' will be correspondsingly tokenized as separate words with spaces...
output = RiTa.tokenize("[email protected]"); // --> (["an email address","@","gmail",".","com"]);
good notice -- can we have the tokenizer treat an email address (or url) as a single token ?
great -- pls sync tests and code with java