rake-rs
rake-rs copied to clipboard
Tokenization of 's
The punctuation regex includes apostrophe, so it splits "foo's" as two separate phrases. I'm seeing "s something" in keywords.
I think it could be fixed by using less smart splitting:
text.split(|c: char| match c {
'.'| ',' | '!' | '?' | ':' | ';' | '(' | ')' | '{' | '}' => true,
_ => false,
}).filter(|s| !s.is_empty()).for_each(|s| {
let mut phrase = Vec::new();
s.split(|c:char| !c.is_alphanumeric() && c != '\'' && c != '’').filter(|s| !s.is_empty()).for_each(|word| {
let word = word.trim_matches(|c: char| !c.is_alphanumeric());
Please note that the library should be multilingual, e.g. ، and ؛ are punctuation characters in Persian. So, \p{P} is easier to be used for multilingual support. However, 's must be ignored as you mentioned.