sudachi.rs icon indicating copy to clipboard operation
sudachi.rs copied to clipboard

"İ" does not behave the same as the Java version of Sudachi.

Open katsutan opened this issue 2 years ago • 2 comments

I have noticed that Sudachi Py and Sudachi behave differently because "İstanbul" is not recognized as a single token in SudachiPy, so I will report it.

$ echo "İstanbul" | sudachipy -a
İ       名詞,普通名詞,一般,*,*,*        I       I       アイ    0       []
        補助記号,一般,*,*,*,*   ̇       ̇               -1      []      (OOV)
stanbul 名詞,普通名詞,一般,*,*,*        stanbul stanbul         -1      []      (OOV)
EOS

$ echo "İstanbul" | sudachi -a
İstanbul        名詞,固有名詞,一般,*,*,*        Istanbul        Istanbul        Istanbul        0       [15600]
EOS

Apparently, the character normalization process is passing different input to each sudachi.

$ echo "İstanbul" | sudachipy -d
=== Inupt dump:
i(U+0307)stanbul

$ echo "İstanbul" | sudachi -d
=== Input dump:
istanbul

It seems that "İ (U + 0130)" is converted to "i (U + 0069)-◌̇ (U + 0307)" in python and "i (U + 0069)" in java. This may be due to the lower specification of each programming language.

katsutan avatar Jul 15 '21 07:07 katsutan

From the Unicode point of view, Python is correct here, Java is not.

eiennohito avatar Nov 11 '21 06:11 eiennohito

İstanbul as a word is fixed in SudachiPy 0.6.1+, İ itself is not. Moving the issue to Sudachi.rs repo.

eiennohito avatar Jan 21 '22 11:01 eiennohito