Word stemming for multi language with snowball, e.g. French, Spanish ...
What ParadeDB currently supports English stemming through "en_stem". Snowball is an out-of-the-box library which supports stemming for many languages, like French,Spanish ... so I suggest this new Feature: stemming for other languages with snowball
Why ES support different languages stemming through plugin
How it can be implemented by using snowball-rust snowball github snowball demo
I read the implementation code of paradedb's [en_stem] tokenizer and found that it uses tnativy's English stemmer. Since tnativy already offers various stemmers, using tnativy's stemmers might be the optimal choice.
https://github.com/quickwit-oss/tantivy/blob/main/src/tokenizer/stemmer.rs
pub enum Language {
Arabic,
Danish,
Dutch,
English,
Finnish,
French,
German,
Greek,
Hungarian,
Italian,
Norwegian,
Portuguese,
Romanian,
Russian,
Spanish,
Swedish,
Tamil,
Turkish,
}
see: https://github.com/paradedb/paradedb/pull/1264#issuecomment-2212610679 there's a simpler way to solve this problem :)