ritajs-v2 icon indicating copy to clipboard operation
ritajs-v2 copied to clipboard

Diacritical characters inconsistency with tokenize()

Open shadoof opened this issue 1 year ago • 5 comments

I quite like the (?) undocumented convention that _ (underscore) allows input text to generate tokens containing spaces, but there is inconsistency when the string preceding the underscore contains a character outside the [a-zA-Z] range:

RT.tokenize("a_là"); // -> ["a là"]
RT.tokenize("a_la"); // -> ["a la"]
RT.tokenize("à_la"); // -> ["à_la"]
RT.tokenize("la_bas"); // -> ["la bas"]
RT.tokenize("lá_bas"); // -> ["lá_bas"]

shadoof avatar Aug 15 '22 17:08 shadoof

@KarlieZhao can you take a look at this?

dhowe avatar Aug 16 '22 03:08 dhowe

@KarlieZhao can you take a look at this?

sure, I'll look into it

KarlieZhao avatar Aug 16 '22 06:08 KarlieZhao

when generating tokens from strings with '_', the tokenizer recognizes a pattern of the underscore and the single letter before and after it. In the above PR, the range of letters are expanded to include from U00C0 to U00FF, in addition to [A-Za-z]. Tokenizer should now be able to handle most latin letters, including letters with tildes or accent marks, and numbers.

One problem I think there might is, however, is that an email address containing '_' will be correspondsingly tokenized as separate words with spaces...

    output = RiTa.tokenize("[email protected]"); // --> (["an email address","@","gmail",".","com"]);

KarlieZhao avatar Aug 19 '22 13:08 KarlieZhao

good notice -- can we have the tokenizer treat an email address (or url) as a single token ?

dhowe avatar Aug 19 '22 14:08 dhowe

great -- pls sync tests and code with java

dhowe avatar Aug 25 '22 15:08 dhowe