tantiny
tantiny copied to clipboard
Custom tokenizer
I want to use Tantiny with Japanese. There are several Tantivy tokenizers for Japanese language. I'm now considering lindera-tantivy which supports not only Japanese but also Chinese and Korean. Is it possible to use these custom tokenizers with Tantivy via Tantiny?
Hey @morygonzalez, currently Tantiny does not support custom tokenizers. I had some ideas how to implement it, but it's a complex issue to tackle due to the fact that it requires extending behaviour in runtime which is not easy to do with Rust (let alone it's interaction with Ruby).
However, it seems that lidera
is quite a useful project and it might make sense to just add a new tokenizer type to Tantiny that uses it. This is much easier than dealing with custom tokenizers. What do you think?
@baygeldin Thank you! That's cool. I'm happy with your suggestion!!
Okay, I'll see what I can do, but probably after I deal with aggregations (or you can make a PR yourself if you want).
I see. I'll try to make a Pull Request though I'm quite new to Rust then it'll take some time.