python-rapidjson segfault on surrogate unicode

Python 3.6.1 (default, Mar 23 2017, 02:34:11)
[GCC 4.9.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import json
>>> import rapidjson
>>> rapidjson.__version__
'0.2.4'
>>> foo = {"foo": "\ud80d"}
>>> json.dumps(foo)
'{"foo": "\\ud80d"}'
>>> rapidjson.dumps(foo)
Segmentation fault (core dumped)

Sep 29 '17 00:09 a-tal

loads also fails:

>>> foo = '{"foo": "\\ud80d"}'
>>> json.loads(foo)
{'foo': '\ud80d'}
>>> rapidjson.loads(foo)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: Parse error at offset 9: The surrogate pair in string is invalid.

Sep 29 '17 00:09 a-tal

The segfault isn't good, but the loads is expected behavior in RapidJSON: https://github.com/Tencent/rapidjson/blob/b45c5408d18e491d8d4c158cbbff5a805089500c/include/rapidjson/reader.h#L1021-L1028 https://github.com/Tencent/rapidjson/blob/52682115fb223518474271023202c904a34865b4/include/rapidjson/error/en.h#L52

Sep 29 '17 01:09 kenrobbins

Interesting:

>>> print("\ud80d".encode('utf-8'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud80d' in position 0: surrogates not allowed

Sep 29 '17 06:09 lelit

Any idea of what should be the correct behaviour here?

Sep 29 '17 06:09 lelit

What you have in the commit seems fine to me. If we want different behavior, seems like we'd have to customize RapidJSON, no?

Sep 29 '17 22:09 kenrobbins

Any idea of what should be the correct behaviour here?

it should fallback to ascii string....

as for print, try it with a file (or network stream), works fine:

>>> foo = {"foo": "my cool string with \ud80d in the middle of it"}
>>> with open("somefile.json", "w") as openjson:
...     print(json.dumps(foo), file=openjson)
...
>>> with open("somefile.json", "r") as openjson:
...     print(openjson.read(), end="")
...
{"foo": "my cool string with \ud80d in the middle of it"}
>>>

Sep 29 '17 23:09 a-tal

I'd be curious to know why RapidJSON opted for not handling those codepoints. Does the JSON standard say anything about?

Sep 30 '17 10:09 lelit

Also yajl does something debatable:

>>> import yajl
>>> a={"foo": "\ud80d"}
>>> yajl.dumps(a)
'{"foo":"\\ud80d"}'
>>> yajl.loads(yajl.dumps(a))
{'foo': '?'}

Jan 13 '18 09:01 lelit

So it seems like the actual coredump is fixed? Does this issue need to be open or can it be closed?

Feb 19 '19 23:02 itamarst

Yes, the crash has been cured. I left this open for the "fallback to ASCII" desired behavior. I'm surely not going to tweak RJ for that, though.

Feb 20 '19 06:02 lelit

I'd be curious to know why RapidJSON opted for not handling those codepoints. Does the JSON standard say anything about?

ECMA-404 says:

Note that the JSON grammar permits code points for which Unicode does not currently provide character assignments.

So it's valid JSON with single surrogate characters in. The processor has two options in my view:

a) produce strings with invalid characters
b) reject valid JSON because of invalid characters
c) produce strings with replacement character U+FFFD, three different known approaches

Option (a) might be helpful when doing low-level autopsies on JSON files but passing invalid characters along inside your application is a big problem, e.g. see https://github.com/encode/django-rest-framework/issues/7026 for how this would cause "error 500" eventually during database access with the Django REST Framework until recently.

I gave a talk about this problem just a few days ago, the slides are up at https://github.com/hartwork/slides_python_json_emoji_crash_story/releases if you're interested.

Any idea of what should be the correct behaviour here?

Please keep detecting and rejecting single surrogates as invalid characters. Besides improvements in speed, it's a core feature of python-rapidjson over stdlib json.

I left this open for the "fallback to ASCII" desired behavior. I'm surely not going to tweak RJ for that, though.

I don't see anything at https://github.com/python-rapidjson/python-rapidjson/issues/81#issuecomment-333259732 that I would consider "falling back to ASCII". I see escaping during JSON encoding and unescaping during JSON decoding. Could you elaborate?

Feb 20 '20 22:02 hartwork

Thanks. Does this mean that in your opinion this issue should be considered as done?

I don't see anything [there] that I would consider "falling back to ASCII".

Uhm, this tricked me... in that comment, Adam answered to my "request for opinion" about the correct behaviour:

Any idea of what should be the correct behaviour here?

it should fallback to ascii string....

I think that means the surrogate unicode would be basically ignored, "left as is", not escaped-unescaped... or am I missing the point?

Feb 22 '20 07:02 lelit

Thanks. Does this mean that in your opinion this issue should be considered as done?

I believe so, yes: I was able to reproduce the segfault with 0.2.4 but neither with 0.2.5 nor with 0.9.1 since they come with your fix 6f123234e9cbc5bc3abb564a715a1f5f8c7ae5a8.

Feb 23 '20 02:02 hartwork

PS: For {"foo": chr(0xd80d)} function dumps has two options:

A) escape U+D80D and write (literal) {"foo": "\ud80d"} to file
B) insist, that invalid characters are not to be written and raise UnicodeEncodeError as-is:

# python -c 'import rapidjson; rapidjson.dumps({"foo": chr(0xd80d)})'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud80d' in position 0: surrogates not allowed

Feb 23 '20 02:02 hartwork