rapidjson icon indicating copy to clipboard operation
rapidjson copied to clipboard

Deserialization fails on invalid unicode code point

Open cmanallen opened this issue 1 year ago • 0 comments

Version python-rapidjson==1.14. To reproduce: import rapidjson; rapidjson.loads('"\ud83c"') Error message: UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83c' in position 1: surrogates not allowed

\ud83c is not a valid unicode code point. Currently deserialization fails. This is uncommon behavior compared to other JSON parsers which deserialize it as an ASCII literal.

Consider the default Python JSON parser which returns the following given a valid and invalid unicode code point.

>>> json.loads('"\u266a"')
'♪'
>>> json.loads('"\ud83c"')
'\ud83c'

As opposed to rapidjson which returns:

>>> rapidjson.loads('"\u266a"')
'♪'
>>> rapidjson.loads('"\ud83c"')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83c' in position 1: surrogates not allowed

cmanallen avatar Jan 29 '24 19:01 cmanallen