lucene icon indicating copy to clipboard operation
lucene copied to clipboard

Clarify the status of Kuromoji dictionaries

Open dweiss opened this issue 2 months ago • 1 comments

Description

While refactoring the gradle code/data generation code I stumbled across the fact that we currently have two different tasks for generating the same set of output files - one is compileMecab, the other is compileNaist. They use different inputs but write to the same output files.

There is also this patch, which seems to be hanging or abandoned - https://github.com/apache/lucene/pull/12517/files

I don't have any experience with Kuromoji... is there any reason to keep both inputs? Should it be configurable at runtime somehow?

At the moment, to get naist dictionary, you need to generate it by hand and recompile Lucene.

dweiss avatar Oct 26 '25 19:10 dweiss

I think your description is correct. There are size implications with some of these dictionaries as well, they can be enormous.

rmuir avatar Oct 26 '25 21:10 rmuir