jiter JSON parsing fails on "lone leading surrogate in hex escape" while normal json.loads don't

JSON parsing fails on "lone leading surrogate in hex escape" while normal json.loads don't

Open lindycoder opened this issue 1 year ago • 4 comments

Hello,

In out migration to pydantic 2, we found a JSON document that pydantic 1 was able to load and pydantic 2 can't with the error:

Invalid JSON: lone leading surrogate in hex escape at line...

Here's a simple way of reproducing:

import json

from pydantic_core import from_json

data = b'{"test": "text\udce2\udc80\udc99text"}'

print(json.loads(data))
print(from_json(data))

This first print from python's json works:

{'test': 'text\udce2\udc80\udc99text'}

The second one using pydantic_core (used by pydantic2) raises

Traceback (most recent call last):
  File "check.py", line 7, in <module>
    print(from_json(data))
          ^^^^^^^^^^^^^^^
ValueError: lone leading surrogate in hex escape at line 1 column 20

Here's some versions

Python 3.12.2
pydantic 2.8.2
pydantic-core 2.20.1

Thank you!

Jul 05 '24 18:07 lindycoder

Moving this to jiter.

We need to check what serde-json does.

Jul 05 '24 18:07 samuelcolvin

Serde fails with the same error message:

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=97bd7df54428c3e668c287b59565cd67

Part of the problem will be that a Python str is allowed to have invalid unicode sequences (see e.g. PEP 383 and the 'surrogateescape' handler) to contain (encoded) arbitrary byte payloads. Decoding to UTF8 (and any UTF8 operations) on these strings will fail.

Rust String data, on the other hand, strictly requires valid UTF8.

Sep 23 '24 08:09 davidhewitt

Hi, I’m encountering the same issue. Just wanted to follow up and check if there are any updates or progress on this? Thanks for your efforts!

import json

from pydantic import TypeAdapter

adapter = TypeAdapter(str)


adapter.validate_json('"\\u266a"')
# '♪'

adapter.validate_json('"\\ud83c"')
# ValidationError: 1 validation error for str
#   Invalid JSON: unexpected end of hex escape at line 1 column 8 [type=json_invalid, input_value='"\\ud83c"', input_type=str]
#     For further information visit https://errors.pydantic.dev/2.11/v/json_invalid

# Temporary workaround
adapter.validate_python(json.loads('"\\ud83c"'))
# '\ud83c'

Apr 18 '25 06:04 unights

+1 Any fixes for this?

Aug 21 '25 11:08 aflah02

jiter jiter copied to clipboard

JSON parsing fails on "lone leading surrogate in hex escape" while normal json.loads don't

jiter
jiter copied to clipboard