kuromoji
kuromoji copied to clipboard
Optimization opportunity in the fst usage.
Please take this report with a big pinch of salt : I am not even a kuromoji user and I did not profile the code thoroughly.
In ViterbiBuilder, kuromoji uses an fst to search for all possible prefix of a given string that are within a dictionary (encoded as the fst). The successive call to lookup however, restart from the rootnode of the fst. It would be advisable to get all of the prefix in a single browse of the fst.
The headroom is valuable, but not massive. Around 15% of the time is spent in Fst.lookup. One can hope to cut this bit in half.
After further investigation...
The largest optimization opportunity is probably isKanjiOnly. (40% is spent in this function and it is possible to bring it down to 0%).
The useless call to substring may also be wasting CPU Time (> 5%). It is hard to measure accurately however as they have hidden impact as well. (it does copy the original string since Java 7.0)
@cmoen Might be of your interest. I do not know the current Lucene's implementation on it, but it is doable by e.g., having a separate function in FST.java that returns the indices of all prefixes which exist in the dictionary.