open_jtalk icon indicating copy to clipboard operation
open_jtalk copied to clipboard

fix setting pause symbol for non-kana symbol

Open sophiefy opened this issue 1 year ago • 0 comments

Maybe this is more of a problem with the dictionary...

njd_set_pronunciation sets read, pron and other features for symbols with 0 mora size. Specifically, non-kana symbols will be set as 読点.

In the following example, is incorrectly parsed as 名詞 using MeCab and naist-jdic (whereas it should be 助詞).

1933年~1937年
1933	名詞,数,*,*,*,*,*
年	名詞,接尾,助数詞,*,*,*,年,ネン,ネン,1/2,C3
~	名詞,サ変接続,*,*,*,*,*
1937	名詞,数,*,*,*,*,*
年	名詞,接尾,助数詞,*,*,*,年,ネン,ネン,1/2,C3

Since its mora size is 0, its read, pron are set to and pos is set to 記号. Consequently, its features would be the following, which is weird.

~,記号,サ変接続,*,*,*,*,~,、,、,0,0,*,0

So I think pos_group, ctype and cform should also be modified and its features become:

~,記号,読点,*,*,*,*,~,、,、,0,0,*,0

sophiefy avatar Sep 18 '23 11:09 sophiefy