tantivy-py Supporting tokenizer register

Currently, the tokenizer is hard-coded to default, it would be better to include some configurable tokenizer for Chinese (tantivy-jieba and cang-jie), Japanese (lindera and tantivy-tokenizer-tiny-segmente) and Korean (lindera + lindera-ko-dic-builder)

https://github.com/tantivy-search/tantivy-py/blob/4ecf7119ea2fc5b3660f38d91a37dfb9e71ece7d/src/schemabuilder.rs#L85

Oct 01 '20 06:10 ghost

@fulmicoton

Also note that tantivy-py does not come with a japanese tokenizer. Tantivy has a good and maintained tokenizer called Lindera. If you know rust, you may have to compile your own version of tantivy-py.

I am trying to add LinderaTokenizer to https://github.com/tantivy-search/tantivy-py/blob/master/src/schemabuilder.rs#L85, but I couldn't figure out where

 index
         .tokenizers()
         .register("lang_ja", LinderaTokenizer::new("decompose", ""));

should go. Do you have any idea?

Oct 01 '20 06:10 ghost

Anywhere as long as it happens before you index your documents.

Also make sure you declared in the schema that you want to use the tokenizer named "lang_ja" for your japanese fields.

Oct 02 '20 00:10 fulmicoton

Note that these tokenizer typically required shipping a dictionary that is several MB large, so it will not be shipped by default. Ideally that should be in a different python package, and registration of the tokenizer should be done by the user as suggested by @acc557

Oct 02 '20 00:10 fulmicoton

What's the progress of adding support configurable tokenizer like tantivy-jieba? This is badly needed for non-ascii text indexing.

May 07 '22 02:05 zhangchunlin

I don't have time to work on this but any help is welcome.

May 09 '22 02:05 fulmicoton

Could you provide some directions/suggestions we can try? I am willing to do something for this. Thank you~

May 12 '22 01:05 zhangchunlin

I think a useful approach might be to add optional features to this crate which can be enabled when building it from source using Maturin to include additional tokenizers. Not sure how to best integrate this with pip's optional dependency support though...

Jun 19 '23 11:06 adamreichold

I will look at this within the next two weeks or so.

Sep 13 '23 14:09 cjrh

For (my own) future reference the upstream tantivy docs for custom tokenizers is here.

Jan 11 '24 10:01 cjrh

I've started working on this in a branch here (currently incomplete): https://github.com/cjrh/tantivy-py/tree/custom-tokenizer-support

I think it will be possible to add support via features as suggested. We could also consider making builds that include support, just to make it a bit easier for users who might not have or want a rust toolchain. But we'll have to be careful about combinatorial explosion of the builds. Perhaps we'll limit the platforms for the "big" build for example.

Jan 21 '24 23:01 cjrh

I've done a bit more work and put up my PR in draft mode #200 . I will try to add tantivy-jieba in a similar way under fflag in the next batch of work I get around to.

The user will have to build the tantivy-py wheel with the additional build-args="--features=lindera" setting. (The tests demonstrate this.)

I've added a small Python test that shows the "user API" of enabling Lindera. We could decide that if the build is a Lindera build, then it should not be necessary to manually register the lindera tokenizer, as below:

def test_basic():
    sb = SchemaBuilder()
    sb.add_text_field("title", stored=True, tokenizer_name="lang_ja")
    schema = sb.build()
    index = Index(schema)
    index.register_lindera_tokenizer()
    writer = index.writer(50_000_000)
    doc = Document()
    doc.add_text("title", "成田国際空港")
    writer.add_document(doc)
    writer.commit()
    index.reload()

What is the user expectation of whether or not something like register_lindera_tokenizer() should be called?

Also, there are things that seem like settings in the configuration of the tokenizer itself (what's "mode"?). ~And finally, the examples in the README at https://github.com/lindera-morphology/lindera-tantivy show use of TextOptions, which means we probably need support for that in tantivy-py?~ (already done)

Jan 29 '24 23:01 cjrh