tokenizer
tokenizer copied to clipboard
french words that contains single quote get broken down
Tokenizer::WhitespaceTokenizer.new.tokenize "et souligne l'interrelation étroite de l'imagerie avec le comportement"
=> ["et", "souligne", "l", "'", "i", "n", "t", "e", "r", "r", "e", "l", "a", "t", "i", "o", "n", "étroite", "de", "l", "'", "i", "m", "a", "g", "e", "r", "i", "e", "avec", "le", "comportement"]
Looking at tokenizer.rb
, this is because: PRE_N_POST = ['"', "'"]
, the single quote is treated as a pre/post splitter, hence assumes that any characters after are tokens. I'll look at tackling this, the only splittables
that look problematic are '
and .
which could appear within a token - the single quote used in french words and the period being used in tokens like email addresses. I was thinking the approach could be to only treat the '
or .
as a splittable if it is at the beginning or end of a token - not within.
With respect to this code in tokenizer.rb
:
output << prefix.split('') unless prefix.empty?
output << stem unless stem.empty?
output << suffix.split('') unless suffix.empty?
Im wonder the reason for .split('')
on the suffix
and prefix
? For example, if for any reason the text being tokenized is not using the best grammar this will result in all text within a token after the splittable having each character tokenized, for example:
tokenize "test(this)"
=> ["test","(","t","h","i","s",")"]
I'm thinking if we avoid using split('')
against the suffix and prefix this could avoid the multiple tokens with a single character after the splittables?