msgspec icon indicating copy to clipboard operation
msgspec copied to clipboard

Incorrect decoding of 65-bit integer

Open jdmsu4 opened this issue 1 year ago • 1 comments

Description

There seems to be a case where 65 bit integers are improperly decoded. I discovered this in a large dynamically generated json file at seemingly random times using the msgspec lib until I tracked down and isolated an example (below). Both larger and smaller encoded integers do not seem to have any issue. I expected the value of 19933688932870350000 to be consistently encoded and decoded as such, but instead received the number 1482881526185800828 as the output. As I was trying to figure out what was going on, I tried using the standard library json.loads() and got the expected value out. When examining the binary representation of the number, I noticed the incorrect value was equivalent to the correct one, besides the leading 4 bits:

Decimal Binary value Bits
Input 19933688932870350000 10001010010010100010000001110010000110100010010110011010001111100 65
Output (incorrect) 1482881526185800828 1010010010100010000001110010000110100010010110011010001111100 61

Here is a minimally reproducible example:

import json
import msgspec

# incorrect value
encoded_json = msgspec.json.encode({"test_value": 19933688932870350000})
decoded_json =  msgspec.json.decode(encoded_json)
print(decoded_json)

# correct value
comp_encoded_json = msgspec.json.encode({"test_value": 19933688932870350000})
comp_decoded_json = json.loads(comp_encoded_json)
print(comp_decoded_json)

jdmsu4 avatar Jun 07 '24 23:06 jdmsu4

I suppose that this piece of code does not detect the overflow correctly: https://github.com/jcrist/msgspec/blob/2c37da090b2c5fcb2ecca9ae00274c67fabb85cf/msgspec/_core.c#L11902-L11909

Apparently the problem appears with the numbers starting from (1 << 64) + (10 ** 18) + 1, for which the above code will have mantissa = ONE_E18 + 1:

>>> test_value = (1 << 64) + (10 ** 18)
>>> msgspec.json.decode(json.dumps(test_value)), test_value
(19446744073709551616, 19446744073709551616)

>>> test_value = (1 << 64) + (10 ** 18) + 1
>>> msgspec.json.decode(json.dumps(test_value)), test_value
(1000000000000000001, 19446744073709551617)

and continues up to 19999999999999999999 (then at 20000000000000000000 the *integer_start != '1' part triggers, and the code starts behaving correctly again):

>>> test_value = 19999999999999999999
>>> msgspec.json.decode(json.dumps(test_value)), test_value
(1553255926290448383, 19999999999999999999)

>>> test_value = 19999999999999999999 + 1
>>> msgspec.json.decode(json.dumps(test_value)), test_value
(20000000000000000000, 20000000000000000000)

sigprof avatar Aug 06 '24 15:08 sigprof