charabia icon indicating copy to clipboard operation
charabia copied to clipboard

Adds support for the use of external dictionaries for segmenters backed by lindera

Open PedroTurik opened this issue 10 months ago • 5 comments

Pull Request

Related issue

Fixes #322

What does this PR do?

  • This PR adds the features korean-segmentation-external and japanese-segmentation-external, that allow the user to decouple the download of the japanese and korean dictionaries from the compilation process, and configure the path to already downloaded lindera compatible dictionaries, with the MEILISEARCH_JAPANESE_EXTERNAL_DICTIONARY and MEILISEARCH_KOREAN_EXTERNAL_DICTIONARY env vars.

this PR is not finished. Since we cant control which dict the user will use, we cant be sure about the segmentation process, so activating the features will disable segmentation tests for now. Another thing worth mentioning is that for lindera to use an external dict, you need generate it through the lindera CLI. The process isnt exactly obvious and needs to be documented. Its described here

PedroTurik avatar Jan 24 '25 17:01 PedroTurik

Thanks for working on this!

If writing documentation is currently holding this back, I'd help out with that.

slatian avatar Feb 04 '25 22:02 slatian

Hey @PedroTurik, Let me know when the work is done and when the CI passes. I'll review your PR!

ManyTheFish avatar Feb 05 '25 11:02 ManyTheFish

Hey @slatian

Yes, you are correct. Where do you think I should document the process to use this feature?

Also, thanks @ManyTheFish. I will fix the CI error and open the PR for review as you said

PedroTurik avatar Feb 06 '25 19:02 PedroTurik

Where do you think I should document the process to use this feature?

As a user of the crate I'd expect the documentation for this with the rest of the documentation on docs.rs -> Document it in a module.

Depending on the length, either a documentation module or a section beneath the "Build features" section in the main module.

But @ManyTheFish will probably have an opinion on that too.

slatian avatar Feb 06 '25 20:02 slatian

Hi @PedroTurik, Could you tell me what you're looking for with this feature? Do you want to use Charabia as a Library, or do you want to use Meilisearch with this?

If you want to use Meilisearch, I suggest activating the feature at runtime because without it, it will not be possible to integrate it easily.

ManyTheFish avatar Feb 10 '25 12:02 ManyTheFish