segfault on surrogate unicode
Python 3.6.1 (default, Mar 23 2017, 02:34:11)
[GCC 4.9.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import json
>>> import rapidjson
>>> rapidjson.__version__
'0.2.4'
>>> foo = {"foo": "\ud80d"}
>>> json.dumps(foo)
'{"foo": "\\ud80d"}'
>>> rapidjson.dumps(foo)
Segmentation fault (core dumped)
loads also fails:
>>> foo = '{"foo": "\\ud80d"}'
>>> json.loads(foo)
{'foo': '\ud80d'}
>>> rapidjson.loads(foo)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: Parse error at offset 9: The surrogate pair in string is invalid.
The segfault isn't good, but the loads is expected behavior in RapidJSON: https://github.com/Tencent/rapidjson/blob/b45c5408d18e491d8d4c158cbbff5a805089500c/include/rapidjson/reader.h#L1021-L1028 https://github.com/Tencent/rapidjson/blob/52682115fb223518474271023202c904a34865b4/include/rapidjson/error/en.h#L52
Interesting:
>>> print("\ud80d".encode('utf-8'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud80d' in position 0: surrogates not allowed
Any idea of what should be the correct behaviour here?
What you have in the commit seems fine to me. If we want different behavior, seems like we'd have to customize RapidJSON, no?
Any idea of what should be the correct behaviour here?
it should fallback to ascii string....
as for print, try it with a file (or network stream), works fine:
>>> foo = {"foo": "my cool string with \ud80d in the middle of it"}
>>> with open("somefile.json", "w") as openjson:
... print(json.dumps(foo), file=openjson)
...
>>> with open("somefile.json", "r") as openjson:
... print(openjson.read(), end="")
...
{"foo": "my cool string with \ud80d in the middle of it"}
>>>
I'd be curious to know why RapidJSON opted for not handling those codepoints. Does the JSON standard say anything about?
Also yajl does something debatable:
>>> import yajl
>>> a={"foo": "\ud80d"}
>>> yajl.dumps(a)
'{"foo":"\\ud80d"}'
>>> yajl.loads(yajl.dumps(a))
{'foo': '?'}
So it seems like the actual coredump is fixed? Does this issue need to be open or can it be closed?
Yes, the crash has been cured. I left this open for the "fallback to ASCII" desired behavior. I'm surely not going to tweak RJ for that, though.
I'd be curious to know why RapidJSON opted for not handling those codepoints. Does the JSON standard say anything about?
ECMA-404 says:
Note that the JSON grammar permits code points for which Unicode does not currently provide character assignments.
So it's valid JSON with single surrogate characters in. The processor has two options in my view:
- a) produce strings with invalid characters
- b) reject valid JSON because of invalid characters
- c) produce strings with replacement character
U+FFFD, three different known approaches
Option (a) might be helpful when doing low-level autopsies on JSON files but passing invalid characters along inside your application is a big problem, e.g. see https://github.com/encode/django-rest-framework/issues/7026 for how this would cause "error 500" eventually during database access with the Django REST Framework until recently.
I gave a talk about this problem just a few days ago, the slides are up at https://github.com/hartwork/slides_python_json_emoji_crash_story/releases if you're interested.
Any idea of what should be the correct behaviour here?
Please keep detecting and rejecting single surrogates as invalid characters.
Besides improvements in speed, it's a core feature of python-rapidjson over stdlib json.
I left this open for the "fallback to ASCII" desired behavior. I'm surely not going to tweak RJ for that, though.
I don't see anything at https://github.com/python-rapidjson/python-rapidjson/issues/81#issuecomment-333259732 that I would consider "falling back to ASCII". I see escaping during JSON encoding and unescaping during JSON decoding. Could you elaborate?
Thanks. Does this mean that in your opinion this issue should be considered as done?
I don't see anything [there] that I would consider "falling back to ASCII".
Uhm, this tricked me... in that comment, Adam answered to my "request for opinion" about the correct behaviour:
Any idea of what should be the correct behaviour here?
it should fallback to ascii string....
I think that means the surrogate unicode would be basically ignored, "left as is", not escaped-unescaped... or am I missing the point?
Thanks. Does this mean that in your opinion this issue should be considered as done?
I believe so, yes: I was able to reproduce the segfault with 0.2.4 but neither with 0.2.5 nor with 0.9.1 since they come with your fix 6f123234e9cbc5bc3abb564a715a1f5f8c7ae5a8.
PS: For {"foo": chr(0xd80d)} function dumps has two options:
- A) escape U+D80D and write (literal)
{"foo": "\ud80d"}to file - B) insist, that invalid characters are not to be written and raise
UnicodeEncodeErroras-is:
# python -c 'import rapidjson; rapidjson.dumps({"foo": chr(0xd80d)})'
Traceback (most recent call last):
File "<string>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud80d' in position 0: surrogates not allowed