kuromoji icon indicating copy to clipboard operation
kuromoji copied to clipboard

Optimization opportunity in the fst usage.

Open fulmicoton opened this issue 6 years ago • 2 comments

Please take this report with a big pinch of salt : I am not even a kuromoji user and I did not profile the code thoroughly.

In ViterbiBuilder, kuromoji uses an fst to search for all possible prefix of a given string that are within a dictionary (encoded as the fst). The successive call to lookup however, restart from the rootnode of the fst. It would be advisable to get all of the prefix in a single browse of the fst.

The headroom is valuable, but not massive. Around 15% of the time is spent in Fst.lookup. One can hope to cut this bit in half.

fulmicoton avatar Sep 01 '19 04:09 fulmicoton

After further investigation...

The largest optimization opportunity is probably isKanjiOnly. (40% is spent in this function and it is possible to bring it down to 0%).

The useless call to substring may also be wasting CPU Time (> 5%). It is hard to measure accurately however as they have hidden impact as well. (it does copy the original string since Java 7.0)

fulmicoton avatar Sep 02 '19 00:09 fulmicoton

@cmoen Might be of your interest. I do not know the current Lucene's implementation on it, but it is doable by e.g., having a separate function in FST.java that returns the indices of all prefixes which exist in the dictionary.

akkikiki avatar Oct 22 '19 09:10 akkikiki