Restcomm-Connect icon indicating copy to clipboard operation
Restcomm-Connect copied to clipboard

Encoding issue related to UTF-16 conversion

Open tomngo opened this issue 7 years ago • 8 comments

Summary

If a message containing a high-order character is posted to a Restcomm number, the payload sent to a Restcomm client is coded with extra null characters.

Steps

Good case:

  1. SMS to 14153014887 this text: What's available today? (note straight single quote U+0027).
  2. See that the Restcomm client receives a payload like this (note the Body): SmsSid=SMde237ddbef704a89b2c2e77b5d019377&AccountSid=AC8d6b8fa45600cf7665e7a5d9d07cfcbd&From=18016969866&To=14153014887&Body=What%27s+available+today%3F

Bad case:

  1. SMS to 14153014887 this text: What’s available today? (note curly single quote U+2019).
  2. See that the Restcomm client receives a payload like this (note the Body): SmsSid=SMf8dd33d6666a4113a9f51ddd62543b60&AccountSid=AC8d6b8fa45600cf7665e7a5d9d07cfcbd&From=18016969866&To=14153014887&Body=%00W%00h%00a%00t+%19%00s%00+%00a%00v%00a%00i%00l%00a%00b%00l%00e%00+%00t%00o%00d%00a%00y%00%3F

Reproducibility and Age

100%. We observed this same thing on 27 Jan 2017. We didn't pay more attention to it back then because we didn't have a customer deal that might be affected by it.

Theory

This is strong circumstantial evidence that, somewhere along the path from source to Restcomm client (and therefore possibly outside of Restcomm):

  • One component is making a binary decision whether to encode as UTF-16BE instead of a mostly-single-byte encoding such as UTF-8
  • A later component is assuming that its input is in the latter encoding.

Here is why. In this discussion, I'll pretend that we know the latter encoding is UTF-8.

  • The character W (U+0057) is encoded in UTF-8 as 57 but in UTF-16BE as 00 57.
  • The character (U+2019) is encoded in UTF-8 as E2 80 99 but in UTF-16BE as 20 19.
  • The character Null (U+0000) is encoded in UTF-8 as 00.
  • The character (U+0020) is encoded in UTF-8 as 20.
  • The character End of Medium (U+0019) is encoded in UTF-8 as 19.
  • A component expecting UTF-8 but receiving 00 57 would interpret Null then W, which would be percent-plus encoded as %00%57.
  • A component expecting UTF-8 but receiving 20 19 would interpret Space then End of Medium, which would be percent-plus encoded as +%19.

tomngo avatar Oct 31 '17 17:10 tomngo

Hi Tom,

What happens when you use basic string functions to normalize messages to UTF-8 and replace non alphanumeric characters with blank space?

Ivelin

ivelin avatar Oct 31 '17 20:10 ivelin

Ivelin, good question. We could recognize this situation and conditionally fix the encoding on our end. We will do that if the problem can't be fixed upstream from us. We strongly prefer that the encoding be fixed upstream, for a couple of reasons (the recognizer would have to be 100% reliable, and we don't want components in the system to co-adapt to each other's special behaviors).

tomngo avatar Oct 31 '17 20:10 tomngo

Here are a couple more observations.

Q. Can we distinguish a UTF-16BE percent plus encoded stream from a UTF-8 percent plus encoded stream with 100% reliability? A. Yes, if the stream starts with the BOM (U+FFEF). But these streams don't. I think that means we could use really good heuristics that are right 99.9% of the time, but I don't think we can guarantee 100%.

Q. What happens if we make a mistake, e.g., try to read a UTF-16BE stream as UTF-8? A. Certain characters will cause the UTF-8 decoder to fail. For instance, anything in the U+00C0 to U+00FF range, all of which are legal and often very common characters such as à and é, will cause a UTF-8 decoding error. In UTF-16BE, those characters have byte streams like 00 C0 through 00 FF. A UTF-8 decoder will see U+0000 followed by an illegal start byte (since no UTF-8 character encoding can start with C0 or above).

tomngo avatar Oct 31 '17 22:10 tomngo

@tomngo that makes sense. We should try to apply this normalization at Restcomm level. @deruelle WDYT?

ivelin avatar Nov 01 '17 17:11 ivelin

I understand now that Restcomm is probably supplying UCS-2, which is a subset of UTF-16BE.

tomngo avatar Nov 01 '17 20:11 tomngo

@deruelle: Any news?

tomngo avatar Nov 10 '17 21:11 tomngo

Heads up @ivelin @deruelle : Something similar to this issue still exists. It's downstream from Restcomm, but I predict that it will affect many Telestax partners other than Lumin. I believe it's not precisely the UCS-2/UTF-8 mismatch that I described above. Here's an example.

  • Lumin sent this internal diagnostic message via SMS: Issue: Abstract pleasantry.hello-again has no variant with args [] (This happens to be a diagnostic message, but the brackets are not uncommon [])
  • Restcomm logged it correctly (SID = SM5eea3a9d79244ac6bb2a5a2606dc63f1)
  • That account ([email protected], SID = AC11338a793e5113bb4adb9871e667a8ce) is tied to Hook Mobile
  • It arrived at Scott Barstow's handset with the last characters garbled; see screenshot

For reference:

  • [ is U+005B
  • ] is U+005D
  • Ä is U+00C4
  • Ñ is U+00D1

tomngo avatar Jul 27 '18 15:07 tomngo

I have filed a new ticket that describes today's behavior, which is somewhat different. It's as if someone put in a fix for specific characters but not a comprehensive fix. See #2994.

tomngo avatar Nov 09 '18 18:11 tomngo