Restcomm-Connect
Restcomm-Connect copied to clipboard
Double-byte encoded incoming SMS message gets corrupted
Summary
Certain higher-order characters from a user's handset to a Restcomm-connected bot get corrupted at the Restcomm level. Not all higher-order characters exhibit this problem.
Related Tickets
There are many tickets related to double-byte messages.
- #2607 opened on Oct 31, 2017 by tomngo : Encoding issue related to UTF-16 conversion
- #2368 opened on Jul 18, 2017 by scottbarstow updated on Aug 2, 2017 : RVD Send SMS does not support Unicode characters
- #1903 opened on Mar 3, 2017 by scottbarstow updated on Jul 19, 2017 : Issue with encoding of Unicode characters
Scope of Impact
Every Restcomm-connected bot that can accept arbitrary natural-language input will be affected. Obviously non-US users will be more affected than US users.
There is no reliable workaround. As discussed in #2607, it's possible for the recipient to distinguish reliably between different encodings only if a BOM (U+FFEF) is present. Otherwise, only heuristics are possible and in many cases the information is simply not recoverable even if the sequence of decoding errors is known.
Isolated to Restcomm
I've changed every variable outside of Restcomm, and the behavior is identical:
- The same thing happens when the message is sent from my handset (on T-Mobile) through a Restcomm instance, whether that Restcomm instance is tied to Teli ([email protected]), or to Hook ([email protected]).
- The same thing happens when the message is sent from my Google Voice line through a Restcomm instance, whether that Restcomm instance is tied to Teli ([email protected]), or to Hook ([email protected]).
- A message carrying an identical string arrives intact if sent from my handset to my Google Voice line without going through Restcomm, or vice versa.
- The corruption is visible in the Restcomm logs, i.e., before reaching our platform.
Affected Characters
Here are some characters that are affected:
-
é
(U+00E9): Latin Small Letter E with Acute -
ñ
(U+00F1): Latin Small Letter N with Tilde -
[
(U+005B): Left Square Bracket -
]
(U+005D): Left Square Bracket -
@
(U+0040): Commercial At -
😀
(U+1F600): Grinning Face
Here are some characters that are not affected:
-
e
(U+0065): Latin Small Letter E -
n
(U+006E): Latin Small Letter N -
‘
(U+2018): Left Single Quotation Mark -
’
(U+2019): Right Single Quotation Mark -
“
(U+201C): Left Double Quotation Mark -
”
(U+201D): Right Double Quotation Mark
Strangely, some characters that are not affected are higher order than some that are affected.
Examples
My name is José Peña.
- Restcomm via Hook: SmsSid
SMa117bca5a48843ada30f545c8964134a
(from T-Mobile) andSM8ab768363a8d44178fc8ff7a642d24be
(from Google Voice) - Restcomm via Teli: SmsSid
SMd933d6eff7a946c4adef1932db99debf
(from T-Mobile) andSM02389664fedb4aa3afff6a79d28aa7d1
(from Google Voice)