open_jtalk
open_jtalk copied to clipboard
fix setting pause symbol for non-kana symbol
Maybe this is more of a problem with the dictionary...
njd_set_pronunciation
sets read
, pron
and other features for symbols with 0 mora size. Specifically, non-kana symbols will be set as 読点
.
In the following example, ~
is incorrectly parsed as 名詞
using MeCab and naist-jdic (whereas it should be 助詞
).
1933年~1937年
1933 名詞,数,*,*,*,*,*
年 名詞,接尾,助数詞,*,*,*,年,ネン,ネン,1/2,C3
~ 名詞,サ変接続,*,*,*,*,*
1937 名詞,数,*,*,*,*,*
年 名詞,接尾,助数詞,*,*,*,年,ネン,ネン,1/2,C3
Since its mora size is 0, its read
, pron
are set to 、
and pos
is set to 記号
. Consequently, its features would be the following, which is weird.
~,記号,サ変接続,*,*,*,*,~,、,、,0,0,*,0
So I think pos_group
, ctype
and cform
should also be modified and its features become:
~,記号,読点,*,*,*,*,~,、,、,0,0,*,0