ritajs-v2 Diacritical characters inconsistency with tokenize()

Diacritical characters inconsistency with tokenize()

Open shadoof opened this issue 1 year ago • 5 comments

I quite like the (?) undocumented convention that _ (underscore) allows input text to generate tokens containing spaces, but there is inconsistency when the string preceding the underscore contains a character outside the [a-zA-Z] range:

RT.tokenize("a_là"); // -> ["a là"]
RT.tokenize("a_la"); // -> ["a la"]
RT.tokenize("à_la"); // -> ["à_la"]
RT.tokenize("la_bas"); // -> ["la bas"]
RT.tokenize("lá_bas"); // -> ["lá_bas"]

Aug 15 '22 17:08 shadoof

@KarlieZhao can you take a look at this?

Aug 16 '22 03:08 dhowe

@KarlieZhao can you take a look at this?

sure, I'll look into it

Aug 16 '22 06:08 KarlieZhao

when generating tokens from strings with '_', the tokenizer recognizes a pattern of the underscore and the single letter before and after it. In the above PR, the range of letters are expanded to include from U00C0 to U00FF, in addition to [A-Za-z]. Tokenizer should now be able to handle most latin letters, including letters with tildes or accent marks, and numbers.

One problem I think there might is, however, is that an email address containing '_' will be correspondsingly tokenized as separate words with spaces...

    output = RiTa.tokenize("[email protected]"); // --> (["an email address","@","gmail",".","com"]);

Aug 19 '22 13:08 KarlieZhao

good notice -- can we have the tokenizer treat an email address (or url) as a single token ?

Aug 19 '22 14:08 dhowe

great -- pls sync tests and code with java

Aug 25 '22 15:08 dhowe

ritajs-v2 ritajs-v2 copied to clipboard

Diacritical characters inconsistency with tokenize()

ritajs-v2
ritajs-v2 copied to clipboard