go-unidecode icon indicating copy to clipboard operation
go-unidecode copied to clipboard

Transliteration error

Open zfLQ2qx2 opened this issue 4 years ago • 4 comments

Thank you very much for the wonderful library, I am glad it is here!

I did come across one transliteration issue - I believe してください should be shitekudasai instead of shitekutasai. I tried a number of Hiragana-Romaji converters which used number of different methods and all choose "da" instead of "ta" for the fourth syllable.

zfLQ2qx2 avatar Mar 02 '20 21:03 zfLQ2qx2

Thanks for your report. I'll fixed it later.

mozillazg avatar Mar 04 '20 01:03 mozillazg

@mozillazg I'm curious to see how you do it, looking at x030.go I see that 0x3060 is "da" so I'm not understanding how it becomes ta to being with.

zfLQ2qx2 avatar Mar 06 '20 15:03 zfLQ2qx2

@zfLQ2qx2 I can't reproduce the issue:

$ go version
go version go1.13.4 darwin/amd64

$ go run unidecode/main.go "してください" | grep shitekudasai
shitekudasai

Let me know if anything was missed.

mozillazg avatar Mar 07 '20 02:03 mozillazg

@mozillazg Looks like the difference is that I'm normalizing the string to fully decomposed form using golang.org/x/text/transform and calling transform.Chain(norm.NFD) prior to transliterating with go-unidecode.

Before Hex: e38197e381a6e3818fe381a0e38195e38184 U+3057 'し' starts at byte position 0 U+3066 'て' starts at byte position 3 U+304F 'く' starts at byte position 6 U+3060 'だ' starts at byte position 9 U+3055 'さ' starts at byte position 12 U+3044 'い' starts at byte position 15

After Hex: e38197e381a6e3818fe3819fe38299e38195e38184 U+3057 'し' starts at byte position 0 U+3066 'て' starts at byte position 3 U+304F 'く' starts at byte position 6 U+305F 'た' starts at byte position 9 U+3099 '゙' starts at byte position 12 U+3055 'さ' starts at byte position 15 U+3044 'い' starts at byte position 18

So looks like the normalization process changes 0x3060 to 0x305F plus 0x3099 (which is "combining katakana-hiragana voiced sound mark") and gets transliterated to "ta" and "" respectively. Ok, so now I understand where "ta" is coming from, so it looks like the workaround is to normalize to the fully composed form instead of decomposed form.

I chose the fully decomposed form because I was trying to match the output of a nodejs function, but honestly there are several test cases for that which are kind of dubious, so I think using the fully composed form and then updating the test cases to match is the way to go.

Apologies for having bothered you with this, but was interesting to work out.

zfLQ2qx2 avatar Mar 09 '20 00:03 zfLQ2qx2