quickwit icon indicating copy to clipboard operation
quickwit copied to clipboard

Does it support chinese tokenizer or custom tokenizer?

Open lingo-xp opened this issue 2 years ago • 2 comments

Describe the solution you'd like Is there any plan to support chinese tokenizer? I just check the document , but find it doesnt support it now. Maybe we can supprt custom tokenizer as plugin, Just like what did in ES.

lingo-xp avatar Jul 07 '22 03:07 lingo-xp

We don't have any plugins at the moment, but a company already requested different language tokenizers so we can probably add chinese too.

I however cannot speak Chinese. Do you know a good chinese tokenizer in rust?

tantivy already supports jieba-rs (via https://github.com/DCjanus/cang-jie) and Lindera (https://github.com/lindera-morphology/lindera-tantivy)

fulmicoton avatar Jul 07 '22 06:07 fulmicoton

@lingo-xp we have just merged a chinese tokenizer #2008

Does this new feature suit your needs?

fmassot avatar Sep 30 '22 09:09 fmassot

Closing, @lingo-xp don't hesitate to reopen it.

fmassot avatar Nov 29 '22 12:11 fmassot