kuromoji icon indicating copy to clipboard operation
kuromoji copied to clipboard

Segmentation wrong with token contains square brackets?

Open reckart opened this issue 8 years ago • 3 comments

Looks like the segmenter does not work properly if there are square brackets, e.g.:

[   名詞,サ変接続,*,*,*,*,*,*,*
滧 名詞,一般,*,*,*,*,*,*,*
の 助詞,連体化,*,*,*,*,の,ノ,ノ
]。    名詞,サ変接続,*,*,*,*,*,*,*

or

「 記号,括弧開,*,*,*,*,「,「,「
国宝  名詞,一般,*,*,*,*,国宝,コクホウ,コクホー
五 名詞,数,*,*,*,*,五,ゴ,ゴ
城 名詞,一般,*,*,*,*,城,シロ,シロ
」[    名詞,サ変接続,*,*,*,*,*,*,*
``

reckart avatar Jul 31 '16 18:07 reckart

I agree that it might be more useful to split ]。 into ] and , but this is actually how the dictionary assets we are using have been designed, but perhaps it might make sense to change some of this. I have some ideas I'd like to try out...

cmoen avatar Aug 05 '16 06:08 cmoen

Just jumping in to say that this outputs highlights a problem with Mecab-ipadic - symbols such as the [] here are treated as 名詞・サ変接続.

see fix for problem here: https://github.com/taku910/mecab/pull/37

mharn avatar Mar 22 '18 01:03 mharn

Thanks. We could also do this using a user-defined unk definition...

cmoen avatar Mar 22 '18 03:03 cmoen