ultrajson
ultrajson copied to clipboard
UltraJSON accepts invalid json
What did you do?
Tried to loads this invalid json {"a": -}'
What did you expect to happen?
I expected some exception to be raised
What actually happened?
I got {'a': 0}
result
What versions are you using?
- OS: Arch Linux
- Python: Python 3.10.5
- UltraJSON: ujson 5.3.0
Please include code that reproduces the issue.
The best reproductions are self-contained scripts with minimal dependencies.
Python 3.10.5 (main, Jun 6 2022, 18:49:26) [GCC 12.1.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import json
>>> import ujson
>>> json.loads('{"a": -}')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.10/json/__init__.py", line 346, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.10/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.10/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 7 (char 6)
>>> ujson.loads('{"a": -}')
{'a': 0}
According to JSON's RFC https://datatracker.ietf.org/doc/html/rfc7159#section-6 standalone minus isn't a valid number. Builtin python json module doesn't accept this json and raises exception. But ujson doesn't raise any exception and parses minus sign as 0. I think it's a bug
Yeah, this was known internally for a while. The negative sign triggers the number detection code, but it never verifies that it has seen at least one digit. The same kind of behaviour also exists with json.loads('-.1')
or '-.1e1'
, by the way.
Note that, while all of these are obviously invalid JSON, the spec allows parsers to accept non-JSON inputs (in section 9). So it's not a violation of the spec.
Since this doesn't have any effect on parsing valid JSON, can't be abused for nefarious purposes as far as I can tell, and would require a special check that would hardly ever get triggered and would therefore usually only negatively impact performance of the number parsing code, I think this is a quirk that we can live with. Although I'm curious how an UNLIKELY(...)
check would perform.
Briefly tested this:
--- a/lib/ultrajsondec.c
+++ b/lib/ultrajsondec.c
@@ -114,6 +114,10 @@ static FASTCALL_ATTR JSOBJ FASTCALL_MSVC decode_numeric (struct DecoderState *ds
{
goto DECODE_INF;
}
+ if (UNLIKELY(*(offset) < '0' || *(offset) > '9'))
+ {
+ return SetError(ds, -1, "Unexpected character found when decoding number");
+ }
maxIntValue = -(JSUINT64) LLONG_MIN;
overflowLimit = maxIntValue / 10LL;
}
When parsing an array consisting of ten thousand -1
s, probably one of the worst cases as that additional check needs to run on every third character, this increases the execution time by about 1.6 % for me, from 186 to 189 μs (python3 -m timeit -s 'import ujson; o = "[" + "-1,"*9999 + "-1]"; l = ujson.loads' 'l(o)'
).
I think I'd still prefer to leave it as it is now since, whilst the cost of fixing it is small, the cost of not fixing is zero as far as I can tell. But it's a tiny thing either way. I'm not fussy if someone else feels more strongly about it.