go-unidecode
go-unidecode copied to clipboard
Transliteration error
Thank you very much for the wonderful library, I am glad it is here!
I did come across one transliteration issue - I believe してください should be shitekudasai instead of shitekutasai. I tried a number of Hiragana-Romaji converters which used number of different methods and all choose "da" instead of "ta" for the fourth syllable.
Thanks for your report. I'll fixed it later.
@mozillazg I'm curious to see how you do it, looking at x030.go I see that 0x3060 is "da" so I'm not understanding how it becomes ta to being with.
@zfLQ2qx2 I can't reproduce the issue:
$ go version
go version go1.13.4 darwin/amd64
$ go run unidecode/main.go "してください" | grep shitekudasai
shitekudasai
Let me know if anything was missed.
@mozillazg Looks like the difference is that I'm normalizing the string to fully decomposed form using golang.org/x/text/transform and calling transform.Chain(norm.NFD) prior to transliterating with go-unidecode.
Before Hex: e38197e381a6e3818fe381a0e38195e38184 U+3057 'し' starts at byte position 0 U+3066 'て' starts at byte position 3 U+304F 'く' starts at byte position 6 U+3060 'だ' starts at byte position 9 U+3055 'さ' starts at byte position 12 U+3044 'い' starts at byte position 15
After Hex: e38197e381a6e3818fe3819fe38299e38195e38184 U+3057 'し' starts at byte position 0 U+3066 'て' starts at byte position 3 U+304F 'く' starts at byte position 6 U+305F 'た' starts at byte position 9 U+3099 '゙' starts at byte position 12 U+3055 'さ' starts at byte position 15 U+3044 'い' starts at byte position 18
So looks like the normalization process changes 0x3060 to 0x305F plus 0x3099 (which is "combining katakana-hiragana voiced sound mark") and gets transliterated to "ta" and "" respectively. Ok, so now I understand where "ta" is coming from, so it looks like the workaround is to normalize to the fully composed form instead of decomposed form.
I chose the fully decomposed form because I was trying to match the output of a nodejs function, but honestly there are several test cases for that which are kind of dubious, so I think using the fully composed form and then updating the test cases to match is the way to go.
Apologies for having bothered you with this, but was interesting to work out.