paradedb icon indicating copy to clipboard operation
paradedb copied to clipboard

Word stemming for multi language with snowball, e.g. French, Spanish ...

Open sunxk opened this issue 2 years ago • 1 comments

What ParadeDB currently supports English stemming through "en_stem". Snowball is an out-of-the-box library which supports stemming for many languages, like French,Spanish ... so I suggest this new Feature: stemming for other languages with snowball

Why ES support different languages stemming through plugin

How it can be implemented by using snowball-rust snowball github snowball demo

sunxk avatar Apr 15 '24 07:04 sunxk

I read the implementation code of paradedb's [en_stem] tokenizer and found that it uses tnativy's English stemmer. Since tnativy already offers various stemmers, using tnativy's stemmers might be the optimal choice.

https://github.com/quickwit-oss/tantivy/blob/main/src/tokenizer/stemmer.rs

pub enum Language {
    Arabic,
    Danish,
    Dutch,
    English,
    Finnish,
    French,
    German,
    Greek,
    Hungarian,
    Italian,
    Norwegian,
    Portuguese,
    Romanian,
    Russian,
    Spanish,
    Swedish,
    Tamil,
    Turkish,
}

sunxk avatar Apr 15 '24 08:04 sunxk

see: https://github.com/paradedb/paradedb/pull/1264#issuecomment-2212610679 there's a simpler way to solve this problem :)

hailelagi avatar Jul 07 '24 23:07 hailelagi