sudachi.rs
sudachi.rs copied to clipboard
"İ" does not behave the same as the Java version of Sudachi.
I have noticed that Sudachi Py and Sudachi behave differently because "İstanbul" is not recognized as a single token in SudachiPy, so I will report it.
$ echo "İstanbul" | sudachipy -a
İ 名詞,普通名詞,一般,*,*,* I I アイ 0 []
補助記号,一般,*,*,*,* ̇ ̇ -1 [] (OOV)
stanbul 名詞,普通名詞,一般,*,*,* stanbul stanbul -1 [] (OOV)
EOS
$ echo "İstanbul" | sudachi -a
İstanbul 名詞,固有名詞,一般,*,*,* Istanbul Istanbul Istanbul 0 [15600]
EOS
Apparently, the character normalization process is passing different input to each sudachi.
$ echo "İstanbul" | sudachipy -d
=== Inupt dump:
i(U+0307)stanbul
$ echo "İstanbul" | sudachi -d
=== Input dump:
istanbul
It seems that "İ (U + 0130)" is converted to "i (U + 0069)-◌̇ (U + 0307)" in python and "i (U + 0069)" in java.
This may be due to the lower
specification of each programming language.
From the Unicode point of view, Python is correct here, Java is not.
İstanbul as a word is fixed in SudachiPy 0.6.1+, İ itself is not. Moving the issue to Sudachi.rs repo.