jschardet
jschardet copied to clipboard
Unicode character problem
Every message that uses the character ç next to another Unicode returns a strange character.
Using encode: UTF-8
çã Shows how згo
çõ Shows how уш
This can only be reproduced if the message is sent from irc to discord irc can not be UTF-8
https://github.com/reactiflux/discord-irc/issues/399
The specific issue seems to be that çã and çõ, in the windows-1252 encoding, are being detected as windows-1251 and IBM855, respectively, and so are interpreted as зг and уш. The context this problem came up in was attempting to convert IRC messages from various encodings into UTF-8, so they can be bridged to Discord.
Example strings:
eu não gosto de diferenciação(in the windows-1252 encoding), erroneously detected aswindows-1251and accordingly interpreted aseu nгo gosto de diferenciaзгo(notice "não" → "nгo" and "ção" → "згo")informações(in the windows-1252 encoding), erroneously detected aswindows-1251and accordingly interpreted asinformaушesça me fait rire(in the windows-1252 encoding), correctly detected aswindows-1252
Since this is likely to be due to conflicting possible encodings, it might be hard to come up with code that distinguishes these situations? The sample languages above are Portuguese and French.
(This issue actually crops up in https://github.com/Throne3d/node-irc, which https://github.com/reactiflux/discord-irc depends on. It uses "jschardet": "^1.6.0" in its dependencies, currently resolved to 1.6.0 in version 0.9.0.)
Affecting Finnish language as well in the same context as above commenter explains.
Few examples that don't bug out:
ei välttämättävin-vin sitsyeissönlämpötilakin nousee vaan vaikka iv puhiseetestää
And a few more that do:
niin kai sitä vois->niin kai sitä voismeniskö sittenkin seiskaan vasta->meniskรถ sittenkin seiskaan vastamä en ota riskiä että tää selkä pahenee->mה en ota riskiה ettה tהה selkה paheneetestätestäätest->ätestäätest->äätest
I am also coming from https://github.com/reactiflux/discord-irc/issues/399 and would like to add the test string kyllä (yes in Finnish) which turns into kyllä and that all the users on my instance are using UTF-8 and I am sure of this as I have disabled support for all other encodings than UTF-8 in my clients (WeeChat: I have unloaded the charset plugin, ZNC: I have selected "send and parse UTF-8 only everywhere).
EDIT: fixed https://github.com/reactiflux/discord-irc/issues/399#issuecomment-393054180