lucene LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary

TL;DR

The current PR attempts to remediate https://issues.apache.org/jira/browse/LUCENE-4056 (Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary)

Description

The current PR builds up upon @johtani's past PR: https://github.com/apache/lucene-solr/pull/935 and attempts to bring the code into a mergeable state, or at least to re-ignite the conversation about building a UniDic dictionary.

Before exporting the current PR, I verified the @johtani's added behavior in the aforementioned PR (plus some changes of my own) by successfully building a number of dictionaries outlined below and posted my findings in the comment: https://github.com/apache/lucene-solr/pull/935#issuecomment-1685887305

Philosophy of changes

I tried to pick up @johtani's added behavior as-is, while minimizing the amount of changes that I added additionally.

To be honest, I do not really like how DictionaryBuilder.DictionaryFormat format is being passed everywhere, as it creates these small decision trees if IPADIC do this else do that. If this PR gets merged, I would be happy to refactor in order to make the code more modular (where possible) and perhaps with a better separation of concerns. But, I defer to the maintainers of the Lucene library on this one. 🙇🏼‍♀️

The commits are fairly atomic and I hope this can make the reviewing experience more easier

Built Dictionaries

unidic-cwj-202302_full (NINJAL)
unidic-cwj-3.1.1-full (NINJAL) (See Caveat below 👇🏼 )
unidic-mecab-2.1.2_src (NINJAL)
mecab-ipadic-2.7.0-20070801

Caveat

RE: Building the unidic-cwj-3.1.1-full:

I had to increase the the length check of the baseForm from 16 to 35

I had to stop throwing an exception for multiple entries of LeftID, which, to be honest I do not understand the full ramifications of. I did print some of the values for the same left IDs instead of throwing when debugging and that's how they look like:

Multiple entries found for leftID=4644: existing=[動詞-一般,五段-ガ行,連用形-融合], new fullPOSData=[動詞-一般,五段-ガ行,仮定形-融合]
Multiple entries found for leftID=4644: existing=[動詞-一般,五段-ガ行,仮定形-融合], new fullPOSData=[動詞-一般,五段-ガ行,連用形-融合]
Multiple entries found for leftID=4644: existing=[動詞-一般,五段-ガ行,連用形-融合], new fullPOSData=[動詞-一般,五段-ガ行,仮定形-融合]
Multiple entries found for leftID=4644: existing=[動詞-一般,五段-ガ行,仮定形-融合], new fullPOSData=[動詞-一般,五段-ガ行,連用形-融合]
Multiple entries found for leftID=14172: existing=[動詞-一般,五段-バ行,仮定形-融合], new fullPOSData=[動詞-一般,五段-バ行,連用形-融合]
Multiple entries found for leftID=1121: existing=[動詞-一般,五段-ガ行,連用形-融合], new fullPOSData=[動詞-一般,五段-ガ行,仮定形-融合]
Multiple entries found for leftID=1121: existing=[動詞-一般,五段-ガ行,仮定形-融合], new fullPOSData=[動詞-一般,五段-ガ行,連用形-融合]
Multiple entries found for leftID=4644: existing=[動詞-一般,五段-ガ行,連用形-融合], new fullPOSData=[動詞-一般,五段-ガ行,仮定形-融合]
Multiple entries found for leftID=4644: existing=[動詞-一般,五段-ガ行,仮定形-融合], new fullPOSData=[動詞-一般,五段-ガ行,連用形-融合]

Some Inflection form values are different for the same leftID, also, why do we have duplicated IDs? ^. Thus, I would value the input of the Subject matter experts here 🙇🏼‍♀️

Building a dictionary

Gradle command

My build command leverages the new Gradle setup and the DictionaryBuilder JavaDoc comment about how to do it:

I added in lucene/analysis/kuromoji/build.gradle a default run task from the Gradle's application plugin, which allows to build a dictionary are as follows. The command should be executed under the root directory lucene, where the gradlew file is.

For example, the following is my command when building unidic-cwj-3.1.1-full dictionary without the NFKD normalization:

./gradlew -p lucene/analysis/kuromoji run --args='unidic "/Users/azagniotov/Downloads/unidic-cwj-3.1.1-full" "/Users/azagniotov/Downloads/unidic-cwj-3.1.1-full/build" "UTF-8" false'

Unit testing

Unfortunately, the current PR does not include unit tests because the built dictionaries files are very big, e.g.: built unidic-cwj-202302_full connection costs .DAT file is around ~700MB, while the unidic-cwj-3.1.1-full connection costs .DAT file is around ~450 MB . Thus, the following unit tests were added and ran locally on my machine to verify that the built dictionaries can be used at runtime.

I did see there there is a main/gradle/generation/kuromoji.gradle that downloads dictionaries and compiles them, ~~but for now, I resorted not to add changes there~~ (https://github.com/apache/lucene/pull/12517/commits/bc9d2e3bb38d096ad193ce41e46c644b4c5673f9).

I also think a dictionary should be decoupled (this is related to the existing discussion in:

https://github.com/apache/lucene/issues/9860.

The built dictionaries were tested using the following Japanese strings (no particular reason, I just picked these four strings):

"にじさんじ"
"ちいかわ"
"桃太郎電鉄"
"聖川真斗"

The dictionaries metadata were placed (dictionary after dictionary) under the lucene/analysis/kuromoji/src/resources/org/apache/lucene/analysis/ja/dict and a few unit test cases were added:

Existing default dictionary already included in Lucene

assertAnalyzesTo(analyzerNoPunct, "にじさんじ", new String[] {"に", "じさ", "ん", "じ"}, new int[] {1, 1, 1, 1});
assertAnalyzesTo(analyzerNoPunct, "ちいかわ", new String[] {"ちい", "か", "わ"}, new int[] {1, 1, 1});
assertAnalyzesTo(analyzerNoPunct, "桃太郎電鉄", new String[] {"桃太郎", "電鉄"}, new int[] {1, 1});
assertAnalyzesTo(analyzerNoPunct, "聖川真斗", new String[] {"聖川", "真", "斗"}, new int[] {1, 1, 1});

Built unidic-cwj-202302_full

I needed to increase memory before running the tests, as the ConnectionCosts.dat is ~700MB: in main/gradle/testing/defaults-tests.gradle#L50

-           value: { -> propertyOrEnvOrDefault("tests.jvmargs", "TEST_JVM_ARGS", isCIBuild ? "" : "-XX:TieredStopAtLevel=1 -XX:+UseParallelGC -XX:ActiveProcessorCount=1") },
+           value: { -> propertyOrEnvOrDefault("tests.jvmargs", "TEST_JVM_ARGS", isCIBuild ? "" : "-XX:MaxDirectMemorySize=2g -XX:TieredStopAtLevel=1 -XX:+UseParallelGC -XX:ActiveProcessorCount=1") },

assertAnalyzesTo(analyzerNoPunct, "にじさんじ", new String[] {"にじ", "さん", "じ"}, new int[] {1, 1, 1});
assertAnalyzesTo(analyzerNoPunct, "ちいかわ", new String[] {"ちい", "かわ"}, new int[] {1, 1});
assertAnalyzesTo(analyzerNoPunct, "桃太郎電鉄", new String[] {"桃", "桃太郎", "太郎", "電鉄"}, new int[] {1, 0, 1, 1});
assertAnalyzesTo(analyzerNoPunct, "聖川真斗", new String[] {"聖", "川", "真", "斗"}, new int[] {1, 1, 1, 1});

Built unidic-cwj-3.1.1-full

assertAnalyzesTo(analyzerNoPunct, "にじさんじ", new String[] {"にじ", "さ", "ん", "じ"}, new int[] {1, 1, 1, 1});
assertAnalyzesTo(analyzerNoPunct, "ちいかわ", new String[] {"ちい", "かわ"}, new int[] {1, 1});
assertAnalyzesTo(analyzerNoPunct, "桃太郎電鉄", new String[] {"桃太郎", "電鉄"}, new int[] {1, 1});
assertAnalyzesTo(analyzerNoPunct, "聖川真斗", new String[] {"聖", "川", "真斗"}, new int[] {1, 1, 1});

Built unidic-mecab-2.1.2_src

assertAnalyzesTo(analyzerNoPunct, "にじさんじ", new String[] {"にじ", "さん", "じ"}, new int[] {1, 1, 1});
assertAnalyzesTo(analyzerNoPunct, "ちいかわ", new String[] {"ちい", "か", "わ"}, new int[] {1, 1, 1});
assertAnalyzesTo(analyzerNoPunct, "桃太郎電鉄", new String[] {"桃", "桃太郎", "太郎", "電鉄"}, new int[] {1, 0, 1, 1});
assertAnalyzesTo(analyzerNoPunct, "聖川真斗", new String[] {"聖", "川", "真", "斗"}, new int[] {1, 1, 1, 1});

Built mecab-ipadic-2.7.0-20070801

 assertAnalyzesTo(analyzerNoPunct, "にじさんじ", new String[] {"に", "じさ", "ん", "じ"}, new int[] {1, 1, 1, 1});
 assertAnalyzesTo(analyzerNoPunct, "ちいかわ", new String[] {"ちい", "か", "わ"}, new int[] {1, 1, 1});
 assertAnalyzesTo(analyzerNoPunct, "桃太郎電鉄", new String[] {"桃太郎", "電鉄"}, new int[] {1, 1});
 assertAnalyzesTo(analyzerNoPunct, "聖川真斗", new String[] {"聖川", "真", "斗"}, new int[] {1, 1, 1});

Aug 22 '23 08:08 azagniotov

I am far from knowing much about Kuromoji and its dictionaries but this sounds like a great change (being able to load a new (UniDic) dictionary format, and the PR still cleanly applies. Are there any concerns with this? @rmuir had mentioned some difficult assert on the Jira issue long ago -- is that resolved / worked around?

Dec 18 '23 13:12 mikemccand

Hi, Please give me some time to review.

Please don't add the application plugin. Instead just add a plain java runner task. The result of the project is a library jar, so please don't change this as it could have effects on the resulting maven pom.

Dec 18 '23 21:12 uschindler

@uschindler thank you for also taking a look, in addition to others. Understood. Please allow me a few days to export a commit that would add a plain java runner task instead of the application plugin. Thank you

Dec 19 '23 05:12 azagniotov

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

Jan 08 '24 12:01 github-actions[bot]

Please don't add the application plugin. Instead just add a plain java runner task. The result of the project is a library jar, so please don't change this as it could have effects on the resulting maven pom.

Hi @uschindler, I ended up simply reverting the 8d52f66 commit. The current gradle/generation/kuromoji.gradle already contains def recompileDictionary(...) which is used by all the task compileXXXX(..) and can be used as an example. Thus, the task and the application plugin that I added were not necessary, after all.

Jan 13 '24 07:01 azagniotov

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

Jan 29 '24 00:01 github-actions[bot]

Hi @uschindler , I wanted to ping you and see if you have any thoughts on this

Feb 03 '24 16:02 azagniotov

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

Feb 18 '24 00:02 github-actions[bot]

I can't tell you anything about the internals touched here, so I can't review it from the language specific point of view.

The Gradle changes look fine.

Feb 18 '24 11:02 uschindler

One thing! Have you checked that a simple recompile of the original mecab (gradlew regenerate) produced same files (working copy clean afterwards)?

Feb 18 '24 11:02 uschindler

@uschindler yes: I have rebased from the latest main branch and ran the ./gradlew clean regenerate. The following is the (partial) output:

...
...
> Task :lucene:analysis:nori:compileMecabKo
Download https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/mecab-ko-dic-2.1.1-20180720.tar.gz

> Task :lucene:analysis:kuromoji:compileMecab
Download https://jaist.dl.sourceforge.net/project/mecab/mecab-ipadic/2.7.0-20070801/mecab-ipadic-2.7.0-20070801.tar.gz
Automaton regenerated from dictionary: mecab-ipadic-2.7.0-20070801

> Task :lucene:analysis:nori:compileMecabKo
Automaton regenerated from dictionary: mecab-ko-dic-2.1.1-20180720

BUILD SUCCESSFUL in 39s
127 actionable tasks: 121 executed, 6 up-to-date

I also confirmed that the dictionary files *.dat were regenerated under the lucene/analysis/kuromoji/src/resources/org/apache/lucene/analysis/ja based on the modified date of the files.

@mocobeta What do you think with regards to next steps with this PR? Is there anything I can add?

Feb 22 '24 18:02 azagniotov

@azagniotov Sorry, I've not been available for a while. Let me take a look; I will try to find time next week...

Feb 22 '24 23:02 mocobeta

Thank you, @mocobeta 🙇🏼

Feb 22 '24 23:02 azagniotov

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

Mar 08 '24 00:03 github-actions[bot]

Hi, sorry for my late reply. I quickly checked the built dictionary size. The latest Unidic is fairly (to me, insanely) large - its total size is 1.6G. https://clrd.ninjal.ac.jp/unidic/back_number.html#unidic_cwj

The built kuromoji jar with unidic-cwj-3.1.1-full eventually becomes 442M. Besides the size, I think we should consider performance. I'm worried that there can be a significant impact on analysis/indexing speed. Do you have any benchmark result on that?

Mar 24 '24 10:03 mocobeta

The built kuromoji jar with unidic-cwj-3.1.1-full eventually becomes 442M. Besides the size, I think we should consider performance. I'm worried that there can be a significant impact on analysis/indexing speed. Do you have any benchmark result on that?

luceneutil has a runAnalyzerPerf.py benchmark, and the nightly script runs it and posts the results here, including Kuromoji. We could maybe test tokens/sec throughput using that benchy for this PR running on UniDic? I'm not sure how we would conclude whether it's fast enough since current Kuromoji (main) can't build UniDic. Maybe we could defer performance testing / optimizing on UniDic to a later issue...

Mar 28 '24 13:03 mikemccand

@mocobeta thank you. I have not done any benchmarks, thus, I cannot comment on potential performance implications. One thing that probably be certain that a larger dictionary will require more memory allocated. Btw, have you had a chance to evaluate the correctness of tokenization?

@mikemccand this sounds interesting. It seems like this is a treaded path. How easy/hard will it be to point runAnalyzerPerf.py to the current PR branch?

Mar 29 '24 00:03 azagniotov

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

Apr 12 '24 00:04 github-actions[bot]

lucene lucene copied to clipboard

LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary

TL;DR

Description

Philosophy of changes

Built Dictionaries

Caveat

Building a dictionary

Gradle command

Unit testing

Existing default dictionary already included in Lucene

Built unidic-cwj-202302_full

Built unidic-cwj-3.1.1-full

Built unidic-mecab-2.1.2_src

Built mecab-ipadic-2.7.0-20070801

lucene
lucene copied to clipboard