tokenizer icon indicating copy to clipboard operation
tokenizer copied to clipboard

french words that contains single quote get broken down

Open joshweir opened this issue 7 years ago • 1 comments

Tokenizer::WhitespaceTokenizer.new.tokenize "et souligne l'interrelation étroite de l'imagerie avec le comportement" 
=> ["et", "souligne", "l", "'", "i", "n", "t", "e", "r", "r", "e", "l", "a", "t", "i", "o", "n", "étroite", "de", "l", "'", "i", "m", "a", "g", "e", "r", "i", "e", "avec", "le", "comportement"]

Looking at tokenizer.rb, this is because: PRE_N_POST = ['"', "'"], the single quote is treated as a pre/post splitter, hence assumes that any characters after are tokens. I'll look at tackling this, the only splittables that look problematic are ' and . which could appear within a token - the single quote used in french words and the period being used in tokens like email addresses. I was thinking the approach could be to only treat the ' or . as a splittable if it is at the beginning or end of a token - not within.

joshweir avatar Mar 31 '17 04:03 joshweir

With respect to this code in tokenizer.rb:

output << prefix.split('') unless prefix.empty?
output << stem unless stem.empty?
output << suffix.split('') unless suffix.empty?

Im wonder the reason for .split('') on the suffix and prefix? For example, if for any reason the text being tokenized is not using the best grammar this will result in all text within a token after the splittable having each character tokenized, for example:

tokenize "test(this)"
=> ["test","(","t","h","i","s",")"]

I'm thinking if we avoid using split('') against the suffix and prefix this could avoid the multiple tokens with a single character after the splittables?

joshweir avatar Mar 31 '17 04:03 joshweir