jschardet icon indicating copy to clipboard operation
jschardet copied to clipboard

Unicode character problem

Open SombraRO opened this issue 6 years ago • 3 comments

Every message that uses the character ç next to another Unicode returns a strange character.

Using encode: UTF-8

çã Shows how згo çõ Shows how уш

This can only be reproduced if the message is sent from irc to discord irc can not be UTF-8

https://github.com/reactiflux/discord-irc/issues/399

SombraRO avatar May 12 '18 15:05 SombraRO

The specific issue seems to be that çã and çõ, in the windows-1252 encoding, are being detected as windows-1251 and IBM855, respectively, and so are interpreted as зг and уш. The context this problem came up in was attempting to convert IRC messages from various encodings into UTF-8, so they can be bridged to Discord.

Example strings:

  • eu não gosto de diferenciação (in the windows-1252 encoding), erroneously detected as windows-1251 and accordingly interpreted as eu nгo gosto de diferenciaзгo (notice "não" → "nгo" and "ção" → "згo")
  • informações (in the windows-1252 encoding), erroneously detected as windows-1251 and accordingly interpreted as informaушes
  • ça me fait rire (in the windows-1252 encoding), correctly detected as windows-1252

Since this is likely to be due to conflicting possible encodings, it might be hard to come up with code that distinguishes these situations? The sample languages above are Portuguese and French.

(This issue actually crops up in https://github.com/Throne3d/node-irc, which https://github.com/reactiflux/discord-irc depends on. It uses "jschardet": "^1.6.0" in its dependencies, currently resolved to 1.6.0 in version 0.9.0.)

Throne3d avatar May 12 '18 16:05 Throne3d

Affecting Finnish language as well in the same context as above commenter explains.

Few examples that don't bug out:

  • ei välttämättä
  • vin-vin sitsyeissön
  • lämpötilakin nousee vaan vaikka iv puhisee
  • testää

And a few more that do:

  • niin kai sitä vois -> niin kai sitä vois
  • meniskö sittenkin seiskaan vasta -> meniskรถ sittenkin seiskaan vasta
  • mä en ota riskiä että tää selkä pahenee -> mה en ota riskiה ettה tהה selkה pahenee
  • testä testä
  • ätest -> ätest
  • äätest -> äätest

redfellow avatar May 23 '18 10:05 redfellow

I am also coming from https://github.com/reactiflux/discord-irc/issues/399 and would like to add the test string kyllä (yes in Finnish) which turns into kyllä and that all the users on my instance are using UTF-8 and I am sure of this as I have disabled support for all other encodings than UTF-8 in my clients (WeeChat: I have unloaded the charset plugin, ZNC: I have selected "send and parse UTF-8 only everywhere).

EDIT: fixed https://github.com/reactiflux/discord-irc/issues/399#issuecomment-393054180

Mikaela avatar May 30 '18 07:05 Mikaela