rake-rs icon indicating copy to clipboard operation
rake-rs copied to clipboard

Tokenization of 's

Open kornelski opened this issue 6 years ago • 1 comments

The punctuation regex includes apostrophe, so it splits "foo's" as two separate phrases. I'm seeing "s something" in keywords.

I think it could be fixed by using less smart splitting:

    text.split(|c: char| match c {
                '.'| ',' | '!' | '?' | ':' | ';' | '(' | ')' | '{' | '}' => true,
                _ => false,
            }).filter(|s| !s.is_empty()).for_each(|s| {
                let mut phrase = Vec::new();
                s.split(|c:char| !c.is_alphanumeric() && c != '\'' && c != '’').filter(|s| !s.is_empty()).for_each(|word| {
                    let word = word.trim_matches(|c: char| !c.is_alphanumeric());

kornelski avatar Mar 12 '19 13:03 kornelski

Please note that the library should be multilingual, e.g. ، and ؛ are punctuation characters in Persian. So, \p{P} is easier to be used for multilingual support. However, 's must be ignored as you mentioned.

yaa110 avatar Mar 12 '19 13:03 yaa110