rapidjson
rapidjson copied to clipboard
Deserialization fails on invalid unicode code point
Version python-rapidjson==1.14
.
To reproduce: import rapidjson; rapidjson.loads('"\ud83c"')
Error message: UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83c' in position 1: surrogates not allowed
\ud83c
is not a valid unicode code point. Currently deserialization fails. This is uncommon behavior compared to other JSON parsers which deserialize it as an ASCII literal.
Consider the default Python JSON parser which returns the following given a valid and invalid unicode code point.
>>> json.loads('"\u266a"')
'♪'
>>> json.loads('"\ud83c"')
'\ud83c'
As opposed to rapidjson which returns:
>>> rapidjson.loads('"\u266a"')
'♪'
>>> rapidjson.loads('"\ud83c"')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83c' in position 1: surrogates not allowed