tantivy-py Optional Lindera tokenizer support (was: Custom tokenizer support)

See #25

This is an early draft for discussion purposes. It's not ready to be used. The primary change is that we incorporate the Lindera tantivy tokenizer support under a feature flag. This means that tantivy-py must be built with that feature flag specified. The included tests have an example of this. It is possible to run the tests both with and without the Lindera support. We need to decide how we want this to be set up.

Jan 29 '24 23:01 cjrh

I've added all the lindera features, and also an interface to specify configurable options on registration. The code is still rough, for example I need to rename quite a few things. I'd love to also figure out how to spell the function signatures correctly, e.g. for the register_lindera_tokenizer() function.

Feb 18 '24 02:02 cjrh

It would be much nicer to provide this as an additional package that can be installed. The lack of a stable ABI prevents this, unless we wrap up the Lindera stuff inside a C wrapper and then call that. I'm in two minds about proceeding with this PR as-is, that's why I haven't moved forward with it. Still thinking about it.

May 30 '24 07:05 cjrh

Just thinking out loud, we could consider adding some kind of new ExternalTokenizer tokenizer, create an instance that encapsulates another tokenizer with a C interface, and that separate tokenizer is wrapped inside the thing with the C interface. Then the latter can be packaged as a wheel and pip installed. Lots of handwaving but basically add a soapy C layer around the tokenizer registration and usage.

May 30 '24 07:05 cjrh

Must be something in the water, because this got published just yesterday. Many projects dealing with the plugin problem: https://www.arroyo.dev/blog/rust-plugin-systems

May 30 '24 08:05 cjrh

Ok so the reason I haven't moved forward with this is because I want to explore different solutions that allow a user to pip install additional tokenizers and register them at runtime. It might be possible to create a C abi interface for tokenizers that will allow this to work.

Jun 27 '24 20:06 cjrh