tantiny icon indicating copy to clipboard operation
tantiny copied to clipboard

Custom tokenizer

Open morygonzalez opened this issue 2 years ago • 4 comments

I want to use Tantiny with Japanese. There are several Tantivy tokenizers for Japanese language. I'm now considering lindera-tantivy which supports not only Japanese but also Chinese and Korean. Is it possible to use these custom tokenizers with Tantivy via Tantiny?

morygonzalez avatar May 19 '22 12:05 morygonzalez

Hey @morygonzalez, currently Tantiny does not support custom tokenizers. I had some ideas how to implement it, but it's a complex issue to tackle due to the fact that it requires extending behaviour in runtime which is not easy to do with Rust (let alone it's interaction with Ruby).

However, it seems that lidera is quite a useful project and it might make sense to just add a new tokenizer type to Tantiny that uses it. This is much easier than dealing with custom tokenizers. What do you think?

baygeldin avatar May 21 '22 12:05 baygeldin

@baygeldin Thank you! That's cool. I'm happy with your suggestion!!

morygonzalez avatar May 22 '22 00:05 morygonzalez

Okay, I'll see what I can do, but probably after I deal with aggregations (or you can make a PR yourself if you want).

baygeldin avatar May 23 '22 14:05 baygeldin

I see. I'll try to make a Pull Request though I'm quite new to Rust then it'll take some time.

morygonzalez avatar May 24 '22 14:05 morygonzalez