cbor-js String encoding breaks in several edge cases

String encoding breaks in several edge cases

Open ryancdotorg opened this issue 4 years ago • 2 comments

The root cause seems to be the string encoder assuming that any character \ud800 is the first half of a well formed UTF-16 surrogate pair. That assumption fails in the following cases:

Code points U+E000 to U+FFFF
Unpaired surrogates

JavaScript strings are not necessarily well formed UTF-16. The code needs to process characters in the range \ud800 to \udbff by checking whether they are followed by a character in the range \udc00 to \udfff, and if not, encoding U+FFFD instead. Anything \udc00 to \udfff by itself should also be encoded as U+FFFD.

For example, CBOR.encode("\uff08\u9999\u6e2f\uff09") gives 6bf3928699e6b8aff3929080 rather than the expected 6cefbc88e9a699e6b8afefbc89.

Oct 24 '19 02:10 ryancdotorg

This is resolved at https://github.com/aaronhuggins/cbor-redux.

Sep 03 '20 21:09 aaronhuggins

This is resolved at CBOR-es in JavaScript (ES Module) with TextEncoder/TextDecoder https://github.com/code4fukui/CBOR-es/blob/master/CBOR.js

Dec 12 '21 04:12 taisukef

cbor-js cbor-js copied to clipboard

String encoding breaks in several edge cases

cbor-js
cbor-js copied to clipboard