kuromoji
kuromoji copied to clipboard
Segmentation wrong with token contains square brackets?
Looks like the segmenter does not work properly if there are square brackets, e.g.:
[ 名詞,サ変接続,*,*,*,*,*,*,*
滧 名詞,一般,*,*,*,*,*,*,*
の 助詞,連体化,*,*,*,*,の,ノ,ノ
]。 名詞,サ変接続,*,*,*,*,*,*,*
or
「 記号,括弧開,*,*,*,*,「,「,「
国宝 名詞,一般,*,*,*,*,国宝,コクホウ,コクホー
五 名詞,数,*,*,*,*,五,ゴ,ゴ
城 名詞,一般,*,*,*,*,城,シロ,シロ
」[ 名詞,サ変接続,*,*,*,*,*,*,*
``
I agree that it might be more useful to split ]。
into ]
and 。
, but this is actually how the dictionary assets we are using have been designed, but perhaps it might make sense to change some of this. I have some ideas I'd like to try out...
Just jumping in to say that this outputs highlights a problem with Mecab-ipadic - symbols such as the [] here are treated as 名詞・サ変接続.
see fix for problem here: https://github.com/taku910/mecab/pull/37
Thanks. We could also do this using a user-defined unk definition...