node-smpp icon indicating copy to clipboard operation
node-smpp copied to clipboard

[WIP] Automatically encode and decode Cyrillic, Hebrew, and KS C 5601

Open scop opened this issue 4 years ago • 6 comments

Did not include these in auto detection, I have a hunch they'd cause false positives / unwanted results.

The test samples are "hello" translations from Google Translate.

scop avatar Apr 04 '20 09:04 scop

Coverage Status

Coverage increased (+0.4%) to 75.837% when pulling 5b33c061ebe3eb9ceda83adfc135d65ebebb2d88 on scop:more-auto-encodings into a530f607f00d6ea53084536bccb2eb45351acdf2 on farhadi:master.

coveralls avatar Apr 04 '20 09:04 coveralls

Hi @scop thanks for the PR

I've reviewed encoding/decoding for Cyrillic and Hebrew and they look good (I cannot find a ksc5601 charmap to review but as long as it's using iconv...)

Would you mind review my code comment and discuss why you think matching will cause false positives / unwanted results? maybe a solution with a regex such as GSM?

juliangut avatar Apr 04 '20 23:04 juliangut

Regarding the auto detection, as said, it's mostly a hunch, but let me elaborate a bit.

First, I guess e.g. cyrillic and hebrew have some overlaps which can make detection produce unwanted (if not false) results between them and also ISO-8859-1, and I have no idea whatsoever about KS C 5601.

Second, I think detection ending up using one of these encodings should in the end not matter much at all. What counts much more in my experience is whether the message can be eventually -- when sending it to its final destination -- encoded in GSM, or if fallback to UCS-2 is needed. These intermediate encodings aren't that interesting.

Third, introducing an uncommon encoding like these could trigger issues elsewhere -- messages that were previously detected as (the widely supported) UCS-2 may now start to be encoded as one of these for little gain, but possibility in non-support in other systems where the messages are relayed to.

Because of the above, amplified by the fact that I'm not personally interested in the detection mode at all currently and thus don't want to spend time around that can of worms ;), I decided to play it safe and exclude these from the detection. If someone wants to change that, it can be done later, but I suggest not delaying getting this in for that.

scop avatar Apr 05 '20 12:04 scop

Re KS C 5601, based on info at https://en.wikipedia.org/wiki/KS_X_1001 I ended up generating the test sample's expected bytes encoding with

echo -n '여보세요' | iconv -f utf-8 -t euc-kr | hd

I'm not sure if euc-kr is always a good substitute for it, but for this sample it works and I guess that's as far as code here should be concerned, the rest can be left as an assumption that iconv-lite's ksc5601 does the desired thing.

If you want to play with it more, maybe this would be useful? https://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/KSC5601.TXT

scop avatar Apr 05 '20 12:04 scop

Anything more I can do here?

scop avatar Oct 04 '20 06:10 scop

I'm going to hold this till we figure out encoding detection

juliangut avatar Oct 25 '20 10:10 juliangut