orjson
orjson copied to clipboard
random crashes after upgrade to 3.9.12
This is from system dmesg output:
[Fri Jan 19 10:41:06 2024] python3[3421008]: segfault at 7fe28bd24000 ip 00007fe296824bde sp 00007ffdd5db46f8 error 4 in orjson.cpython-312-x86_64-linux-gnu.so[7fe2967fe000+2f000] [Fri Jan 19 10:41:06 2024] Code: 66 66 66 2e 0f 1f 84 00 00 00 00 00 4c 01 c0 4c 01 c6 49 f7 d0 4c 01 c2 4c 89 10 4c 01 c8 48 ff c6 48 85 d2 0f 84 dd 02 00 00 <c5> fe 6f 1e c5 fe 7f 18 c5 e5 74 e0 c5 e5 74 e9 c5 d5 eb e4 c5 e5
Not sure if other people encounter similar issues.
I'm also running into this issue with 3.9.12. Fix was to revert to 3.9.10.
Randomly seg faults at different times in my test suite. Seems to be related to the NUMPY opt.
tests/... Fatal Python error: Segmentation fault
Current thread 0x00007ebb93c28b80 (most recent call first):
...
File "/opt/venv/lib/python3.12/site-packages/_pytest/python.py", line 194 in pytest_pyfunc_call
File "/opt/venv/lib/python3.12/site-packages/pluggy/_callers.py", line 77 in _multicall
File "/opt/venv/lib/python3.12/site-packages/pluggy/_manager.py", line 115 in _hookexec
File "/opt/venv/lib/python3.12/site-packages/pluggy/_hooks.py", line 493 in __call__
File "/opt/venv/lib/python3.12/site-packages/_pytest/python.py", line 1792 in runtest
File "/opt/venv/lib/python3.12/site-packages/_pytest/runner.py", line 169 in pytest_runtest_call
File "/opt/venv/lib/python3.12/site-packages/pluggy/_callers.py", line 77 in _multicall
File "/opt/venv/lib/python3.12/site-packages/pluggy/_manager.py", line 115 in _hookexec
File "/opt/venv/lib/python3.12/site-packages/pluggy/_hooks.py", line 493 in __call__
File "/opt/venv/lib/python3.12/site-packages/_pytest/runner.py", line 262 in <lambda>
File "/opt/venv/lib/python3.12/site-packages/_pytest/runner.py", line 341 in from_call
File "/opt/venv/lib/python3.12/site-packages/_pytest/runner.py", line 261 in call_runtest_hook
File "/opt/venv/lib/python3.12/site-packages/_pytest/runner.py", line 222 in call_and_report
File "/opt/venv/lib/python3.12/site-packages/_pytest/runner.py", line 133 in runtestprotocol
File "/opt/venv/lib/python3.12/site-packages/_pytest/runner.py", line 114 in pytest_runtest_protocol
File "/opt/venv/lib/python3.12/site-packages/pluggy/_callers.py", line 77 in _multicall
File "/opt/venv/lib/python3.12/site-packages/pluggy/_manager.py", line 115 in _hookexec
File "/opt/venv/lib/python3.12/site-packages/pluggy/_hooks.py", line 493 in __call__
File "/opt/venv/lib/python3.12/site-packages/_pytest/main.py", line 350 in pytest_runtestloop
File "/opt/venv/lib/python3.12/site-packages/pluggy/_callers.py", line 77 in _multicall
File "/opt/venv/lib/python3.12/site-packages/pluggy/_manager.py", line 115 in _hookexec
File "/opt/venv/lib/python3.12/site-packages/pluggy/_hooks.py", line 493 in __call__
File "/opt/venv/lib/python3.12/site-packages/_pytest/main.py", line 325 in _main
File "/opt/venv/lib/python3.12/site-packages/_pytest/main.py", line 271 in wrap_session
File "/opt/venv/lib/python3.12/site-packages/_pytest/main.py", line 318 in pytest_cmdline_main
File "/opt/venv/lib/python3.12/site-packages/pluggy/_callers.py", line 77 in _multicall
File "/opt/venv/lib/python3.12/site-packages/pluggy/_manager.py", line 115 in _hookexec
File "/opt/venv/lib/python3.12/site-packages/pluggy/_hooks.py", line 493 in __call__
File "/opt/venv/lib/python3.12/site-packages/_pytest/config/__init__.py", line 169 in main
File "/opt/venv/lib/python3.12/site-packages/_pytest/config/__init__.py", line 192 in console_main
File "/opt/venv/lib/python3.12/site-packages/pytest/__main__.py", line 5 in <module>
File "<frozen runpy>", line 88 in _run_code
File "<frozen runpy>", line 198 in _run_module_as_main
Extension modules: markupsafe._speedups, confluent_kafka.cimpl, recordclass._dataobject, recordclass._litelist, recordclass._litetuple, charset_normalizer.md, guppy.sets.setsc, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, guppy.heapy.heapyc (total: 21)
I have also noticed the crashes, and reverting back to 3.9.10 indeed fixed the issue. I tested 3.9.11 and 3.9.12, and both had similar behavior. I face the issue randomly when running this kind of code:
for file in ~200 files:
# where file contains a highly nested dict of just dicts and strings (no numpy, no numbers, no dates)
data = yaml.load(file)
buffer = orjson.dumps(data, option=orjson.OPT_INDENT_2 | orjson.OPT_SORT_KEYS)
Sometimes it crashes, sometimes not.
Just chiming in here to say that I'm also seeing these issues, on both Mac and Linux. My Mac produced a crash report, and this seems to be the relevant section.
Termination Reason: Namespace SIGNAL, Code 11 Segmentation fault: 11
Terminating Process: Python [64584]
VM Region Info: 0x106c00000 is not in any region. Bytes after previous region: 1 Bytes before following region: 114688
REGION TYPE START - END [ VSIZE] PRT/MAX SHRMOD REGION DETAIL
MALLOC_TINY 106b00000-106c00000 [ 1024K] rw-/rwx SM=PRV
---> GAP OF 0x1c000 BYTES
__TEXT 106c1c000-106d2c000 [ 1088K] r-x/rwx SM=COW ...311-darwin.so
Kernel Triage:
VM - (arg = 0x3) mach_vm_allocate_kernel failed within call to vm_map_enter
Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0 libsystem_kernel.dylib 0x18da4b11c __pthread_kill + 8
1 libsystem_pthread.dylib 0x18da82cc0 pthread_kill + 288
2 libsystem_c.dylib 0x18d95b57c raise + 32
3 Python 0x1052c9a00 faulthandler_fatal_error + 448
4 libsystem_platform.dylib 0x18dab1a24 _sigtramp + 56
5 orjson.cpython-311-darwin.so 0x106499bb8 0x106490000 + 39864
6 orjson.cpython-311-darwin.so 0x1064990cc 0x106490000 + 37068
7 orjson.cpython-311-darwin.so 0x10649ce40 0x106490000 + 52800
8 orjson.cpython-311-darwin.so 0x10649eb40 0x106490000 + 60224
9 orjson.cpython-311-darwin.so 0x10649fe60 dumps + 612
10 Python 0x10525eb84 _PyEval_EvalFrameDefault + 46716
11 Python 0x105262070 _PyEval_Vector + 116
12 Python 0x10525f7c4 _PyEval_EvalFrameDefault + 49852
13 Python 0x105262070 _PyEval_Vector + 116
14 Python 0x10525f7c4 _PyEval_EvalFrameDefault + 49852
15 Python 0x1052528c4 PyEval_EvalCode + 168
16 Python 0x1052a93f0 run_eval_code_obj + 84
17 Python 0x1052a9354 run_mod + 112
18 Python 0x1052ab790 PyRun_StringFlags + 112
19 Python 0x1052ab6d8 PyRun_SimpleStringFlags + 64
20 Python 0x1052c46d4 pymain_run_command + 144
21 Python 0x1052c41a8 Py_RunMain + 228
22 Python 0x1052c54c0 Py_BytesMain + 40
23 dyld 0x18d709058 start + 2224
Hi, I'm also experiencing segfaults randomly. I've tried finding a minimal reproducible example, but it looks non deterministic.
I've managed to detect it with Valgrind though:
==1180649== Invalid read of size 32
==1180649== at 0x6AFFA8B: ??? (in /home/david/code/.venv311/lib/python3.11/site-packages/orjson/orjson.cpython-311-x86_64-linux-gnu.so)
==1180649== by 0x6AFDFB2: ??? (in /home/david/code/.venv311/lib/python3.11/site-packages/orjson/orjson.cpython-311-x86_64-linux-gnu.so)
==1180649== by 0x1FFEFFEFAF: ???
==1180649== Address 0x84d6f3ff0 is in a rw- anonymous segment
==1180649==
Also experiencing this on an orjson.dumps inside a fastapi application. https://github.com/tiangolo/fastapi/blob/92feb735317996ef81763da370efa92c61a6d925/fastapi/responses.py#L46
As best I can tell this is the commit that introduced issues? Code that used to be disabled by default was enabled. https://github.com/ijl/orjson/commit/a40f58b8519a83ed7b3079a953ec6c96c16c015c
Got similar issue after several days of searching
[ 7452.698006] python[63069]: segfault at 7f51b335f000 ip 00007f524b6218ae sp 00007fffd26adfb8 error 4 in orjson.cpython-38-x86_64-linux-gnu.so[7f524b5fa000+30000] likely on CPU 4 (core 0, socket 0)
[ 7452.720881] Code: 66 66 66 2e 0f 1f 84 00 00 00 00 00 4c 01 c0 4c 01 c6 49 f7 d0 4c 01 c2 4c 89 10 4c 01 c8 48 ff c6 48 85 d2 0f 84 dd 02 00 00 <c5> fe 6f 1e c5 fe 7f 18 c5 e5 74 e0 c5 e5 74 e9 c5 d5 eb e4 c5 e5
also true for 3.9.11
Same guys
uvicorn[3257]: segfault at 7f8063274000 ip 00007f80a45e3abe sp 00007ffd23351e78 error 4 in orjson.cpython-310-x86_64-linux-gnu.so[7f80a45bd000+2f000]
We have the same random segfaults after the upgrading orjson from 3.9.10 to 3.9.12
python -VV:
Python 3.9.2 (default, Feb 28 2021, 17:03:44) [GCC 10.2.1 20210110]
This is a bit of a stab in the dark, but from commit 520525860b2ecbbf82e7dc154cc0f8e5cdfa61f7:
https://github.com/ijl/orjson/blob/4eb4f005a6f1b71609051770612a055b584b73d2/src/serialize/writer/simd.rs#L97-L98
We know from the termination of the previous while loop that nb < STRIDE, and this load is not aligned, so what’s to stop it from overreading the end of the source allocation?
This theory is consistent with all of the reported segfault addresses being at the beginning of a page.
A test case that doesn’t segfault but makes Valgrind angry:
$ valgrind python -c 'import orjson; orjson.dumps((b"\n" + b"x" * 4046).decode())'
==50092== Memcheck, a memory error detector
==50092== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==50092== Using Valgrind-3.22.0 and LibVEX; rerun with -h for copyright info
==50092== Command: python -c import\ orjson;\ orjson.dumps((b"\\n"\ +\ b"x"\ *\ 4046).decode())
==50092==
==50092== Invalid read of size 16
==50092== at 0x12DAA988: orjson::serialize::writer::simd::format_escaped_str_impl_128 (simd.rs:0)
==50092== by 0x12DA85C9: format_escaped_str<&mut orjson::serialize::writer::byteswriter::BytesWriter> (json.rs:578)
==50092== by 0x12DA85C9: serialize_str<&mut orjson::serialize::writer::byteswriter::BytesWriter, orjson::serialize::writer::formatter::CompactFormatter> (json.rs:165)
==50092== by 0x12DA85C9: <orjson::serialize::per_type::unicode::StrSerializer as serde::ser::Serialize>::serialize (unicode.rs:29)
==50092== by 0x12DACB7A: to_writer<&mut orjson::serialize::writer::byteswriter::BytesWriter, orjson::serialize::serializer::PyObjectSerializer> (json.rs:605)
==50092== by 0x12DACB7A: serialize (serializer.rs:25)
==50092== by 0x12DACB7A: dumps (lib.rs:354)
==50092== by 0x49BC251: cfunction_vectorcall_FASTCALL_KEYWORDS (in /nix/store/w4fvvhkzb0ssv0fw5j34pw09f0qw84w8-python3-3.11.7/lib/libpython3.11.so.1.0)
==50092== by 0x4AB22C2: PyObject_Vectorcall (in /nix/store/w4fvvhkzb0ssv0fw5j34pw09f0qw84w8-python3-3.11.7/lib/libpython3.11.so.1.0)
==50092== by 0x494D8DC: _PyEval_EvalFrameDefault (in /nix/store/w4fvvhkzb0ssv0fw5j34pw09f0qw84w8-python3-3.11.7/lib/libpython3.11.so.1.0)
==50092== by 0x4B8256B: _PyEval_Vector.constprop.0 (in /nix/store/w4fvvhkzb0ssv0fw5j34pw09f0qw84w8-python3-3.11.7/lib/libpython3.11.so.1.0)
==50092== by 0x4B82709: PyEval_EvalCode (in /nix/store/w4fvvhkzb0ssv0fw5j34pw09f0qw84w8-python3-3.11.7/lib/libpython3.11.so.1.0)
==50092== by 0x4BAD42F: run_mod (in /nix/store/w4fvvhkzb0ssv0fw5j34pw09f0qw84w8-python3-3.11.7/lib/libpython3.11.so.1.0)
==50092== by 0x4BCE6FC: PyRun_SimpleStringFlags (in /nix/store/w4fvvhkzb0ssv0fw5j34pw09f0qw84w8-python3-3.11.7/lib/libpython3.11.so.1.0)
==50092== by 0x4BCE9B4: Py_RunMain (in /nix/store/w4fvvhkzb0ssv0fw5j34pw09f0qw84w8-python3-3.11.7/lib/libpython3.11.so.1.0)
==50092== by 0x4FB50CD: (below main) (in /nix/store/7jiqcrg061xi5clniy7z5pvkc4jiaqav-glibc-2.38-27/lib/libc.so.6)
==50092== Address 0x13e203a1 is 4,081 bytes inside a block of size 4,096 alloc'd
==50092== at 0x484276B: malloc (in /nix/store/1iai1iry6zw0fn4b2rnb93yx4vgpd9bi-valgrind-3.22.0/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==50092== by 0x4981DBF: _PyObject_Malloc (in /nix/store/w4fvvhkzb0ssv0fw5j34pw09f0qw84w8-python3-3.11.7/lib/libpython3.11.so.1.0)
==50092== by 0x49EF4DB: PyUnicode_New.part.0 (in /nix/store/w4fvvhkzb0ssv0fw5j34pw09f0qw84w8-python3-3.11.7/lib/libpython3.11.so.1.0)
==50092== by 0x49B0DDF: unicode_decode_utf8 (in /nix/store/w4fvvhkzb0ssv0fw5j34pw09f0qw84w8-python3-3.11.7/lib/libpython3.11.so.1.0)
==50092== by 0x4A981F1: method_vectorcall_FASTCALL_KEYWORDS (in /nix/store/w4fvvhkzb0ssv0fw5j34pw09f0qw84w8-python3-3.11.7/lib/libpython3.11.so.1.0)
==50092== by 0x4AB22C2: PyObject_Vectorcall (in /nix/store/w4fvvhkzb0ssv0fw5j34pw09f0qw84w8-python3-3.11.7/lib/libpython3.11.so.1.0)
==50092== by 0x494D8DC: _PyEval_EvalFrameDefault (in /nix/store/w4fvvhkzb0ssv0fw5j34pw09f0qw84w8-python3-3.11.7/lib/libpython3.11.so.1.0)
==50092== by 0x4B8256B: _PyEval_Vector.constprop.0 (in /nix/store/w4fvvhkzb0ssv0fw5j34pw09f0qw84w8-python3-3.11.7/lib/libpython3.11.so.1.0)
==50092== by 0x4B82709: PyEval_EvalCode (in /nix/store/w4fvvhkzb0ssv0fw5j34pw09f0qw84w8-python3-3.11.7/lib/libpython3.11.so.1.0)
==50092== by 0x4BAD42F: run_mod (in /nix/store/w4fvvhkzb0ssv0fw5j34pw09f0qw84w8-python3-3.11.7/lib/libpython3.11.so.1.0)
==50092== by 0x4BCE6FC: PyRun_SimpleStringFlags (in /nix/store/w4fvvhkzb0ssv0fw5j34pw09f0qw84w8-python3-3.11.7/lib/libpython3.11.so.1.0)
==50092== by 0x4BCE9B4: Py_RunMain (in /nix/store/w4fvvhkzb0ssv0fw5j34pw09f0qw84w8-python3-3.11.7/lib/libpython3.11.so.1.0)
==50092==
==50092==
==50092== HEAP SUMMARY:
==50092== in use at exit: 620,813 bytes in 215 blocks
==50092== total heap usage: 6,016 allocs, 5,801 frees, 10,140,991 bytes allocated
==50092==
==50092== LEAK SUMMARY:
==50092== definitely lost: 0 bytes in 0 blocks
==50092== indirectly lost: 0 bytes in 0 blocks
==50092== possibly lost: 0 bytes in 0 blocks
==50092== still reachable: 620,813 bytes in 215 blocks
==50092== suppressed: 0 bytes in 0 blocks
==50092== Rerun with --leak-check=full to see details of leaked memory
==50092==
==50092== For lists of detected and suppressed errors, rerun with: -s
==50092== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
Just an update: latest version seems to have fixed seg fault issue, at least no seg faults observed since I upgraded to 3.9.13 two days ago.
I suspect 3.9.13 reduced the probability of the issue since 58a8bd3e31aa3b5fd3d962fb5b03479fa0014ee9 decreased the maximum overread from 31 bytes to 15 bytes, but it’s not eliminated. The Valgrind trace I posted above is from 3.9.13.
Yep. I agree with you. Hope your pull request will be merged in soon, so we don't have buffer overread issue.
I see that in 528220fb0d18bbf0212de7f0ce5c7aec209bc6e7 you’ve added a check for whether the pointer crosses a page boundary and reinstated the buffer overread if it doesn’t. But a buffer overread is undefined behavior whether or not a page boundary is crossed. Valgrind still flags the same error with my above test case in 3.9.14.
Undefined behavior will cause problems eventually, even if the symptom isn’t as obvious as a segfault, and it might seem like it’s working until there’s a clever new compiler optimization relying on an incorrect invariant inferred from the contract that the program has broken. We need to avoid all UB, not just paper over its observed symptoms.
Moreover, it can’t possibly be saving significant time here, given this code is only there for handling the end of the buffer.
Please fully remove the buffer overread.
- #457
I am still seeing crashes with 3.9.14, although less frequently. It might not be exactly the same issue mentioned upthread (my stack trace didn't make sense and I can't repro on demand yet).