Alexander Nadeau
Alexander Nadeau
Thanks for the response. I'll polish up the deconjugator changes and make a pull request before looking at kuromoji. Making it coexist with the old one is a good idea,...
Added it to a branch. It's ready for me to start testing it. https://github.com/wareya/Spark-Reader/tree/kuromoji  Since it still has the same text splitter underlying it (trying to piece segments together...
Kuromoji is only invoked after split() already does its job, inside splitSegment(), so manual splits still work. I'd never add anything that if it meant manual splits wouldn't work. Kuromoji's...
The kuromoji branch is starting to feel mature. I use heuristics to coerce the word splitter to avoid making certain known bad splits, ones that aren't handled well by a...
>For the current blacklist system: if a word is blacklisted, can it currently be removed from the blacklist through the UI? if it's never matched I'm not sure if it...
If a name is refusing to be parsed as a whole, either heuristics are enabled and there's a problematic heuristic, or heuristics are disabled and there's a bug in the...
This is happening because 沙 and 夜の目 are both valid words, and that's how kuromoji is segmenting them. This is still very unintuitive. We need a way to tell kuromoji...
Fixed the issue where manual splits wouldn't let it work. Improving the initial segmentation is up to either blacklisting 沙 or 夜の目, or me forking kuromoji like I plan on...
Kuromoji has a lexeme surface form database that looks like this: [...] 小野原,4790,4790,9770,名詞,固有名詞,人名,姓,*,*,[...] おのぶ,4789,4789,10366,名詞,固有名詞,人名,名,*,*,[...] お信,4789,4789,10366,名詞,固有名詞,人名,名,*,*,[...] お延,4789,4789,10366,名詞,固有名詞,人名,名,*,*,[...] 尾登,4790,4790,9770,名詞,固有名詞,人名,姓,*,*,[...] [...] The third number in the value list is some kind of weight...
It turns out kuromoji supports custom dictionaries, it's just not documented properly. It can load them from an InputStream while initializing the tokenizer. They have to be in the same...