Restcomm-Connect
Restcomm-Connect copied to clipboard
Encoding issue related to UTF-16 conversion
Summary
If a message containing a high-order character is posted to a Restcomm number, the payload sent to a Restcomm client is coded with extra null characters.
Steps
Good case:
- SMS to
14153014887
this text:What's available today?
(note straight single quoteU+0027
). - See that the Restcomm client receives a payload like this (note the Body):
SmsSid=SMde237ddbef704a89b2c2e77b5d019377&AccountSid=AC8d6b8fa45600cf7665e7a5d9d07cfcbd&From=18016969866&To=14153014887&Body=What%27s+available+today%3F
Bad case:
- SMS to
14153014887
this text:What’s available today?
(note curly single quoteU+2019
). - See that the Restcomm client receives a payload like this (note the Body):
SmsSid=SMf8dd33d6666a4113a9f51ddd62543b60&AccountSid=AC8d6b8fa45600cf7665e7a5d9d07cfcbd&From=18016969866&To=14153014887&Body=%00W%00h%00a%00t+%19%00s%00+%00a%00v%00a%00i%00l%00a%00b%00l%00e%00+%00t%00o%00d%00a%00y%00%3F
Reproducibility and Age
100%. We observed this same thing on 27 Jan 2017. We didn't pay more attention to it back then because we didn't have a customer deal that might be affected by it.
Theory
This is strong circumstantial evidence that, somewhere along the path from source to Restcomm client (and therefore possibly outside of Restcomm):
- One component is making a binary decision whether to encode as UTF-16BE instead of a mostly-single-byte encoding such as UTF-8
- A later component is assuming that its input is in the latter encoding.
Here is why. In this discussion, I'll pretend that we know the latter encoding is UTF-8.
- The character
W
(U+0057
) is encoded in UTF-8 as57
but in UTF-16BE as00 57
. - The character
’
(U+2019
) is encoded in UTF-8 asE2 80 99
but in UTF-16BE as20 19
. - The character Null (
U+0000
) is encoded in UTF-8 as00
. - The character
U+0020
) is encoded in UTF-8 as20
. - The character End of Medium (
U+0019
) is encoded in UTF-8 as19
. - A component expecting UTF-8 but receiving
00 57
would interpret Null thenW
, which would be percent-plus encoded as%00%57
. - A component expecting UTF-8 but receiving
20 19
would interpret Space then End of Medium, which would be percent-plus encoded as+%19
.
Hi Tom,
What happens when you use basic string functions to normalize messages to UTF-8 and replace non alphanumeric characters with blank space?
Ivelin
Ivelin, good question. We could recognize this situation and conditionally fix the encoding on our end. We will do that if the problem can't be fixed upstream from us. We strongly prefer that the encoding be fixed upstream, for a couple of reasons (the recognizer would have to be 100% reliable, and we don't want components in the system to co-adapt to each other's special behaviors).
Here are a couple more observations.
Q. Can we distinguish a UTF-16BE percent plus encoded stream from a UTF-8 percent plus encoded stream with 100% reliability?
A. Yes, if the stream starts with the BOM (U+FFEF
). But these streams don't. I think that means we could use really good heuristics that are right 99.9% of the time, but I don't think we can guarantee 100%.
Q. What happens if we make a mistake, e.g., try to read a UTF-16BE stream as UTF-8?
A. Certain characters will cause the UTF-8 decoder to fail. For instance, anything in the U+00C0
to U+00FF
range, all of which are legal and often very common characters such as à
and é
, will cause a UTF-8 decoding error. In UTF-16BE, those characters have byte streams like 00 C0
through 00 FF
. A UTF-8 decoder will see U+0000
followed by an illegal start byte (since no UTF-8 character encoding can start with C0
or above).
@tomngo that makes sense. We should try to apply this normalization at Restcomm level. @deruelle WDYT?
I understand now that Restcomm is probably supplying UCS-2, which is a subset of UTF-16BE.
@deruelle: Any news?
Heads up @ivelin @deruelle : Something similar to this issue still exists. It's downstream from Restcomm, but I predict that it will affect many Telestax partners other than Lumin. I believe it's not precisely the UCS-2/UTF-8 mismatch that I described above. Here's an example.
- Lumin sent this internal diagnostic message via SMS:
Issue: Abstract pleasantry.hello-again has no variant with args []
(This happens to be a diagnostic message, but the brackets are not uncommon[]
) - Restcomm logged it correctly (SID =
SM5eea3a9d79244ac6bb2a5a2606dc63f1
) - That account ([email protected], SID =
AC11338a793e5113bb4adb9871e667a8ce
) is tied to Hook Mobile - It arrived at Scott Barstow's handset with the last characters garbled; see screenshot
For reference:
-
[
isU+005B
-
]
isU+005D
-
Ä
isU+00C4
-
Ñ
isU+00D1
I have filed a new ticket that describes today's behavior, which is somewhat different. It's as if someone put in a fix for specific characters but not a comprehensive fix. See #2994.