Double-byte encoded incoming SMS message gets corrupted

Open tomngo opened this issue 6 years ago • 0 comments

Summary

Certain higher-order characters from a user's handset to a Restcomm-connected bot get corrupted at the Restcomm level. Not all higher-order characters exhibit this problem.

Related Tickets

There are many tickets related to double-byte messages.

#2607 opened on Oct 31, 2017 by tomngo : Encoding issue related to UTF-16 conversion
#2368 opened on Jul 18, 2017 by scottbarstow updated on Aug 2, 2017 : RVD Send SMS does not support Unicode characters
#1903 opened on Mar 3, 2017 by scottbarstow updated on Jul 19, 2017 : Issue with encoding of Unicode characters

Scope of Impact

Every Restcomm-connected bot that can accept arbitrary natural-language input will be affected. Obviously non-US users will be more affected than US users.

There is no reliable workaround. As discussed in #2607, it's possible for the recipient to distinguish reliably between different encodings only if a BOM (U+FFEF) is present. Otherwise, only heuristics are possible and in many cases the information is simply not recoverable even if the sequence of decoding errors is known.

Isolated to Restcomm

I've changed every variable outside of Restcomm, and the behavior is identical:

The same thing happens when the message is sent from my handset (on T-Mobile) through a Restcomm instance, whether that Restcomm instance is tied to Teli ([email protected]), or to Hook ([email protected]).
The same thing happens when the message is sent from my Google Voice line through a Restcomm instance, whether that Restcomm instance is tied to Teli ([email protected]), or to Hook ([email protected]).
A message carrying an identical string arrives intact if sent from my handset to my Google Voice line without going through Restcomm, or vice versa.
The corruption is visible in the Restcomm logs, i.e., before reaching our platform.

Affected Characters

Here are some characters that are affected:

é (U+00E9): Latin Small Letter E with Acute
ñ (U+00F1): Latin Small Letter N with Tilde
[ (U+005B): Left Square Bracket
] (U+005D): Left Square Bracket
@ (U+0040): Commercial At
😀 (U+1F600): Grinning Face

Here are some characters that are not affected:

e (U+0065): Latin Small Letter E
n (U+006E): Latin Small Letter N
‘ (U+2018): Left Single Quotation Mark
’ (U+2019): Right Single Quotation Mark
“ (U+201C): Left Double Quotation Mark
” (U+201D): Right Double Quotation Mark

Strangely, some characters that are not affected are higher order than some that are affected.

Examples

My name is José Peña.

Restcomm via Hook: SmsSid SMa117bca5a48843ada30f545c8964134a (from T-Mobile) and SM8ab768363a8d44178fc8ff7a642d24be (from Google Voice)
Restcomm via Teli: SmsSid SMd933d6eff7a946c4adef1932db99debf (from T-Mobile) and SM02389664fedb4aa3afff6a79d28aa7d1 (from Google Voice)

Nov 08 '18 02:11 tomngo

Restcomm-Connect Restcomm-Connect copied to clipboard

Double-byte encoded incoming SMS message gets corrupted

Summary

Related Tickets

Scope of Impact

Isolated to Restcomm

Affected Characters

Examples

Restcomm-Connect
Restcomm-Connect copied to clipboard