charabia
charabia copied to clipboard
Adds support for the use of external dictionaries for segmenters backed by lindera
Pull Request
Related issue
Fixes #322
What does this PR do?
- This PR adds the features korean-segmentation-external and japanese-segmentation-external, that allow the user to decouple the download of the japanese and korean dictionaries from the compilation process, and configure the path to already downloaded lindera compatible dictionaries, with the MEILISEARCH_JAPANESE_EXTERNAL_DICTIONARY and MEILISEARCH_KOREAN_EXTERNAL_DICTIONARY env vars.
this PR is not finished. Since we cant control which dict the user will use, we cant be sure about the segmentation process, so activating the features will disable segmentation tests for now. Another thing worth mentioning is that for lindera to use an external dict, you need generate it through the lindera CLI. The process isnt exactly obvious and needs to be documented. Its described here
Thanks for working on this!
If writing documentation is currently holding this back, I'd help out with that.
Hey @PedroTurik, Let me know when the work is done and when the CI passes. I'll review your PR!
Hey @slatian
Yes, you are correct. Where do you think I should document the process to use this feature?
Also, thanks @ManyTheFish. I will fix the CI error and open the PR for review as you said
Where do you think I should document the process to use this feature?
As a user of the crate I'd expect the documentation for this with the rest of the documentation on docs.rs -> Document it in a module.
Depending on the length, either a documentation module or a section beneath the "Build features" section in the main module.
But @ManyTheFish will probably have an opinion on that too.
Hi @PedroTurik, Could you tell me what you're looking for with this feature? Do you want to use Charabia as a Library, or do you want to use Meilisearch with this?
If you want to use Meilisearch, I suggest activating the feature at runtime because without it, it will not be possible to integrate it easily.