lucene
lucene copied to clipboard
LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary
TL;DR
The current PR attempts to remediate https://issues.apache.org/jira/browse/LUCENE-4056 (Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary)
Description
The current PR builds up upon @johtani's past PR: https://github.com/apache/lucene-solr/pull/935 and attempts to bring the code into a mergeable state, or at least to re-ignite the conversation about building a UniDic dictionary.
Before exporting the current PR, I verified the @johtani's added behavior in the aforementioned PR (plus some changes of my own) by successfully building a number of dictionaries outlined below and posted my findings in the comment: https://github.com/apache/lucene-solr/pull/935#issuecomment-1685887305
Philosophy of changes
I tried to pick up @johtani's added behavior as-is, while minimizing the amount of changes that I added additionally.
To be honest, I do not really like how DictionaryBuilder.DictionaryFormat format is being passed everywhere, as it creates these small decision trees if IPADIC do this else do that. If this PR gets merged, I would be happy to refactor in order to make the code more modular (where possible) and perhaps with a better separation of concerns. But, I defer to the maintainers of the Lucene library on this one. 🙇🏼♀️
The commits are fairly atomic and I hope this can make the reviewing experience more easier
Built Dictionaries
- unidic-cwj-202302_full (NINJAL)
- unidic-cwj-3.1.1-full (NINJAL) (See Caveat below 👇🏼 )
- unidic-mecab-2.1.2_src (NINJAL)
- mecab-ipadic-2.7.0-20070801
Caveat
RE: Building the unidic-cwj-3.1.1-full:
- I had to increase the the length check of the baseForm from
16to35 - I had to stop throwing an exception for multiple entries of LeftID, which, to be honest I do not understand the full ramifications of. I did print some of the values for the same left IDs instead of throwing when debugging and that's how they look like:
Multiple entries found for leftID=4644: existing=[動詞-一般,五段-ガ行,連用形-融合], new fullPOSData=[動詞-一般,五段-ガ行,仮定形-融合] Multiple entries found for leftID=4644: existing=[動詞-一般,五段-ガ行,仮定形-融合], new fullPOSData=[動詞-一般,五段-ガ行,連用形-融合] Multiple entries found for leftID=4644: existing=[動詞-一般,五段-ガ行,連用形-融合], new fullPOSData=[動詞-一般,五段-ガ行,仮定形-融合] Multiple entries found for leftID=4644: existing=[動詞-一般,五段-ガ行,仮定形-融合], new fullPOSData=[動詞-一般,五段-ガ行,連用形-融合] Multiple entries found for leftID=14172: existing=[動詞-一般,五段-バ行,仮定形-融合], new fullPOSData=[動詞-一般,五段-バ行,連用形-融合] Multiple entries found for leftID=1121: existing=[動詞-一般,五段-ガ行,連用形-融合], new fullPOSData=[動詞-一般,五段-ガ行,仮定形-融合] Multiple entries found for leftID=1121: existing=[動詞-一般,五段-ガ行,仮定形-融合], new fullPOSData=[動詞-一般,五段-ガ行,連用形-融合] Multiple entries found for leftID=4644: existing=[動詞-一般,五段-ガ行,連用形-融合], new fullPOSData=[動詞-一般,五段-ガ行,仮定形-融合] Multiple entries found for leftID=4644: existing=[動詞-一般,五段-ガ行,仮定形-融合], new fullPOSData=[動詞-一般,五段-ガ行,連用形-融合]
Some Inflection form values are different for the same leftID, also, why do we have duplicated IDs? ^. Thus, I would value the input of the Subject matter experts here 🙇🏼♀️
Building a dictionary
Gradle command
My build command leverages the new Gradle setup and the DictionaryBuilder JavaDoc comment about how to do it:
I added in lucene/analysis/kuromoji/build.gradle a default run task from the Gradle's application plugin, which allows to build a dictionary are as follows. The command should be executed under the root directory lucene, where the gradlew file is.
For example, the following is my command when building unidic-cwj-3.1.1-full dictionary without the NFKD normalization:
./gradlew -p lucene/analysis/kuromoji run --args='unidic "/Users/azagniotov/Downloads/unidic-cwj-3.1.1-full" "/Users/azagniotov/Downloads/unidic-cwj-3.1.1-full/build" "UTF-8" false'
Unit testing
Unfortunately, the current PR does not include unit tests because the built dictionaries files are very big, e.g.: built unidic-cwj-202302_full connection costs .DAT file is around ~700MB, while the unidic-cwj-3.1.1-full connection costs .DAT file is around ~450 MB . Thus, the following unit tests were added and ran locally on my machine to verify that the built dictionaries can be used at runtime.
I did see there there is a main/gradle/generation/kuromoji.gradle that downloads dictionaries and compiles them, ~~but for now, I resorted not to add changes there~~ (https://github.com/apache/lucene/pull/12517/commits/bc9d2e3bb38d096ad193ce41e46c644b4c5673f9).
I also think a dictionary should be decoupled (this is related to the existing discussion in:
- https://github.com/apache/lucene/issues/9860.
The built dictionaries were tested using the following Japanese strings (no particular reason, I just picked these four strings):
"にじさんじ""ちいかわ""桃太郎電鉄""聖川真斗"
The dictionaries metadata were placed (dictionary after dictionary) under the lucene/analysis/kuromoji/src/resources/org/apache/lucene/analysis/ja/dict and a few unit test cases were added:
Existing default dictionary already included in Lucene
assertAnalyzesTo(analyzerNoPunct, "にじさんじ", new String[] {"に", "じさ", "ん", "じ"}, new int[] {1, 1, 1, 1});
assertAnalyzesTo(analyzerNoPunct, "ちいかわ", new String[] {"ちい", "か", "わ"}, new int[] {1, 1, 1});
assertAnalyzesTo(analyzerNoPunct, "桃太郎電鉄", new String[] {"桃太郎", "電鉄"}, new int[] {1, 1});
assertAnalyzesTo(analyzerNoPunct, "聖川真斗", new String[] {"聖川", "真", "斗"}, new int[] {1, 1, 1});
Built unidic-cwj-202302_full
I needed to increase memory before running the tests, as the ConnectionCosts.dat is ~700MB: in main/gradle/testing/defaults-tests.gradle#L50
- value: { -> propertyOrEnvOrDefault("tests.jvmargs", "TEST_JVM_ARGS", isCIBuild ? "" : "-XX:TieredStopAtLevel=1 -XX:+UseParallelGC -XX:ActiveProcessorCount=1") },
+ value: { -> propertyOrEnvOrDefault("tests.jvmargs", "TEST_JVM_ARGS", isCIBuild ? "" : "-XX:MaxDirectMemorySize=2g -XX:TieredStopAtLevel=1 -XX:+UseParallelGC -XX:ActiveProcessorCount=1") },
assertAnalyzesTo(analyzerNoPunct, "にじさんじ", new String[] {"にじ", "さん", "じ"}, new int[] {1, 1, 1});
assertAnalyzesTo(analyzerNoPunct, "ちいかわ", new String[] {"ちい", "かわ"}, new int[] {1, 1});
assertAnalyzesTo(analyzerNoPunct, "桃太郎電鉄", new String[] {"桃", "桃太郎", "太郎", "電鉄"}, new int[] {1, 0, 1, 1});
assertAnalyzesTo(analyzerNoPunct, "聖川真斗", new String[] {"聖", "川", "真", "斗"}, new int[] {1, 1, 1, 1});
Built unidic-cwj-3.1.1-full
assertAnalyzesTo(analyzerNoPunct, "にじさんじ", new String[] {"にじ", "さ", "ん", "じ"}, new int[] {1, 1, 1, 1});
assertAnalyzesTo(analyzerNoPunct, "ちいかわ", new String[] {"ちい", "かわ"}, new int[] {1, 1});
assertAnalyzesTo(analyzerNoPunct, "桃太郎電鉄", new String[] {"桃太郎", "電鉄"}, new int[] {1, 1});
assertAnalyzesTo(analyzerNoPunct, "聖川真斗", new String[] {"聖", "川", "真斗"}, new int[] {1, 1, 1});
Built unidic-mecab-2.1.2_src
assertAnalyzesTo(analyzerNoPunct, "にじさんじ", new String[] {"にじ", "さん", "じ"}, new int[] {1, 1, 1});
assertAnalyzesTo(analyzerNoPunct, "ちいかわ", new String[] {"ちい", "か", "わ"}, new int[] {1, 1, 1});
assertAnalyzesTo(analyzerNoPunct, "桃太郎電鉄", new String[] {"桃", "桃太郎", "太郎", "電鉄"}, new int[] {1, 0, 1, 1});
assertAnalyzesTo(analyzerNoPunct, "聖川真斗", new String[] {"聖", "川", "真", "斗"}, new int[] {1, 1, 1, 1});
Built mecab-ipadic-2.7.0-20070801
assertAnalyzesTo(analyzerNoPunct, "にじさんじ", new String[] {"に", "じさ", "ん", "じ"}, new int[] {1, 1, 1, 1});
assertAnalyzesTo(analyzerNoPunct, "ちいかわ", new String[] {"ちい", "か", "わ"}, new int[] {1, 1, 1});
assertAnalyzesTo(analyzerNoPunct, "桃太郎電鉄", new String[] {"桃太郎", "電鉄"}, new int[] {1, 1});
assertAnalyzesTo(analyzerNoPunct, "聖川真斗", new String[] {"聖川", "真", "斗"}, new int[] {1, 1, 1});
I am far from knowing much about Kuromoji and its dictionaries but this sounds like a great change (being able to load a new (UniDic) dictionary format, and the PR still cleanly applies. Are there any concerns with this? @rmuir had mentioned some difficult assert on the Jira issue long ago -- is that resolved / worked around?
Hi, Please give me some time to review.
Please don't add the application plugin. Instead just add a plain java runner task. The result of the project is a library jar, so please don't change this as it could have effects on the resulting maven pom.
@uschindler thank you for also taking a look, in addition to others. Understood. Please allow me a few days to export a commit that would add a plain java runner task instead of the application plugin. Thank you
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!
Please don't add the application plugin. Instead just add a plain java runner task. The result of the project is a library jar, so please don't change this as it could have effects on the resulting maven pom.
Hi @uschindler, I ended up simply reverting the 8d52f66 commit. The current gradle/generation/kuromoji.gradle already contains def recompileDictionary(...) which is used by all the task compileXXXX(..) and can be used as an example. Thus, the task and the application plugin that I added were not necessary, after all.
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!
Hi @uschindler , I wanted to ping you and see if you have any thoughts on this
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!
I can't tell you anything about the internals touched here, so I can't review it from the language specific point of view.
The Gradle changes look fine.
One thing! Have you checked that a simple recompile of the original mecab (gradlew regenerate) produced same files (working copy clean afterwards)?
@uschindler yes: I have rebased from the latest main branch and ran the ./gradlew clean regenerate. The following is the (partial) output:
...
...
> Task :lucene:analysis:nori:compileMecabKo
Download https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/mecab-ko-dic-2.1.1-20180720.tar.gz
> Task :lucene:analysis:kuromoji:compileMecab
Download https://jaist.dl.sourceforge.net/project/mecab/mecab-ipadic/2.7.0-20070801/mecab-ipadic-2.7.0-20070801.tar.gz
Automaton regenerated from dictionary: mecab-ipadic-2.7.0-20070801
> Task :lucene:analysis:nori:compileMecabKo
Automaton regenerated from dictionary: mecab-ko-dic-2.1.1-20180720
BUILD SUCCESSFUL in 39s
127 actionable tasks: 121 executed, 6 up-to-date
I also confirmed that the dictionary files *.dat were regenerated under the lucene/analysis/kuromoji/src/resources/org/apache/lucene/analysis/ja based on the modified date of the files.
@mocobeta What do you think with regards to next steps with this PR? Is there anything I can add?
@azagniotov Sorry, I've not been available for a while. Let me take a look; I will try to find time next week...
Thank you, @mocobeta 🙇🏼
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!
Hi, sorry for my late reply. I quickly checked the built dictionary size. The latest Unidic is fairly (to me, insanely) large - its total size is 1.6G. https://clrd.ninjal.ac.jp/unidic/back_number.html#unidic_cwj
The built kuromoji jar with unidic-cwj-3.1.1-full eventually becomes 442M. Besides the size, I think we should consider performance. I'm worried that there can be a significant impact on analysis/indexing speed. Do you have any benchmark result on that?
The built kuromoji jar with unidic-cwj-3.1.1-full eventually becomes 442M. Besides the size, I think we should consider performance. I'm worried that there can be a significant impact on analysis/indexing speed. Do you have any benchmark result on that?
luceneutil has a runAnalyzerPerf.py benchmark, and the nightly script runs it and posts the results here, including Kuromoji. We could maybe test tokens/sec throughput using that benchy for this PR running on UniDic? I'm not sure how we would conclude whether it's fast enough since current Kuromoji (main) can't build UniDic. Maybe we could defer performance testing / optimizing on UniDic to a later issue...
@mocobeta thank you. I have not done any benchmarks, thus, I cannot comment on potential performance implications. One thing that probably be certain that a larger dictionary will require more memory allocated. Btw, have you had a chance to evaluate the correctness of tokenization?
@mikemccand this sounds interesting. It seems like this is a treaded path. How easy/hard will it be to point runAnalyzerPerf.py to the current PR branch?
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!