cpython icon indicating copy to clipboard operation
cpython copied to clipboard

SIGSEV in `datetime.timedelta` (possibly from datetime's C `delta_new`)

Open Jacoblightning opened this issue 8 months ago • 37 comments

Crash report

What happened?

I was testing some AI code with ollama and stumbled across a really weird crash. The fact that it happens during an IndexError and a specific function has to be there leads me to believe that this is a CPython bug and not an ollama bug.

from ollama import chat

# Has to be here to segfault???
def colorSwitch(color): 
    print(color, end="", flush=True)


stream = chat(
    model="llama3.2", # I think it works with any but this is what I used.
    messages=[{"role": "user", "content": ""}],
    options={"seed":0}, # Does not need this but I figured it would be helpful
    stream=True,
)
# Any iteration works. I just simplified it down to this.
part = next(iter(stream))['message']['content']
temp = part.split("</think>", 1)

# Crash Here
temp[1]

PythonCore.zip

CPython versions tested on:

3.13

Operating systems tested on:

Linux

Output from running 'python -VV' on the command line:

Python 3.13.2 (main, Feb 5 2025, 08:05:21) [GCC 14.2.1 20250128]

Linked PRs

  • gh-132599
  • gh-132665
  • gh-133111
  • gh-136152
  • gh-136321

Jacoblightning avatar Apr 11 '25 17:04 Jacoblightning

I was testing some AI code with ollama and stumbled across a really weird crash.

What's the crash? please provide the traceback (shown on the terminal if possible) (not just the core dump).

picnixz avatar Apr 11 '25 17:04 picnixz

Sorry. Here is the python traceback. If you wanted a gdb or other traceback, let me know.

  File "crasher1.py", line 19, in <module>
    temp[1]
    ~~~~^^^
IndexError: list index out of range
Segmentation fault (core dumped)

Jacoblightning avatar Apr 11 '25 17:04 Jacoblightning

If you wanted a gdb or other traceback, let me know

If possible yes, so that we know where the crash exactly happens. If possible, you can use python -X faulthandler crasher.py as well though I'm not sure if we're able to know more.

Also, it might be that ollama is using the Python C API behind the scene (AFAIU, ollama is written in Go and C but there are Python bindings, which may be the ones where the issue arise).

picnixz avatar Apr 11 '25 17:04 picnixz

Ah. It appears that it is an ollama issue after all. Ill raise it over there.

Fatal Python error: Segmentation fault

Current thread 0x00007d584222bbc0 (most recent call first):
  Garbage-collecting
  File "/home/jacoblightning3/PycharmProjects/AiFight/.venv/lib/python3.13/site-packages/httpx/_client.py", line 158 in close
  File "/home/jacoblightning3/PycharmProjects/AiFight/.venv/lib/python3.13/site-packages/httpx/_models.py", line 972 in close
  File "/home/jacoblightning3/PycharmProjects/AiFight/.venv/lib/python3.13/site-packages/httpx/_client.py", line 877 in stream
  File "/usr/lib/python3.13/contextlib.py", line 162 in __exit__
  File "/home/jacoblightning3/PycharmProjects/AiFight/.venv/lib/python3.13/site-packages/ollama/_client.py", line 163 in inner
Segmentation fault (core dumped)

Jacoblightning avatar Apr 11 '25 17:04 Jacoblightning

@picnixz So, I just checked and it appears that ollama-python is pure python. (Ofc I realized this after I made the issue.) Would that bring the issue back here? They don't appear to be using CTypes, etc.

Jacoblightning avatar Apr 11 '25 17:04 Jacoblightning

That does look like it could be a CPython issue, though ollama has a few dependencies that include compiled code. I tried to reproduce it (3.13.1, MacOS, latest ollama) but I got httpx.ConnectError: [Errno 61] Connection refused instead (does ollama require a local server or something? probably not something I'm interested in setting up).

Two useful ways forward could be:

  • Get the full C stack trace in gdb or lldb or a similar tool and explore what's happening when we hit the segfault. For example, maybe that will tell you what type it's looking at when the crash happens.
  • Reduce the reproducer to something simpler. For example, you can start by removing more and more parts of ollama that aren't relevant to the crash and then see if it still reproduces.

JelleZijlstra avatar Apr 11 '25 17:04 JelleZijlstra

I tried to reproduce it (3.13.1, MacOS, latest ollama) but I got httpx.ConnectError: [Errno 61] Connection refused instead (does ollama require a local server or something? probably not something I'm interested in setting up).

Yes. The ollama package requires ollama to be installed and running a local server.

Jacoblightning avatar Apr 11 '25 18:04 Jacoblightning

Running python with the debug build I just compiled produces:

Modules/_datetimemodule.c:2745:13: runtime error: member access within null pointer of type 'struct datetime_state'
AddressSanitizer:DEADLYSIGNAL
=================================================================
==9445==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000010 (pc 0x7c8cc6b93737 bp 0x7ffe5c2d11f0 sp 0x7ffe5c2d1040 T0)
==9445==The signal is caused by a READ memory access.
==9445==Hint: address points to the zero page.
    #0 0x7c8cc6b93737 in delta_new Modules/_datetimemodule.c:2745
    #1 0x5790c86108b1 in type_call Objects/typeobject.c:1987
    #2 0x5790c8410b28 in _PyObject_MakeTpCall Objects/call.c:242
    #3 0x5790c8410fc7 in _PyObject_VectorcallTstate Include/internal/pycore_call.h:166
    #4 0x5790c8411018 in PyObject_Vectorcall Objects/call.c:327
    #5 0x5790c883c457 in _PyEval_EvalFrameDefault Python/generated_cases.c.h:1502
    #6 0x5790c8472fda in _PyEval_EvalFrame Include/internal/pycore_ceval.h:119
    #7 0x5790c8473ce6 in gen_send_ex2 Objects/genobject.c:229
    #8 0x5790c84786a3 in gen_send_ex Objects/genobject.c:270
    #9 0x5790c847adb5 in _gen_throw Objects/genobject.c:543
    #10 0x5790c847b17c in gen_throw Objects/genobject.c:580
    #11 0x5790c883ebd2 in _PyEval_EvalFrameDefault Python/generated_cases.c.h:1640
    #12 0x5790c8883c6b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:119
    #13 0x5790c8883f2f in _PyEval_Vector Python/ceval.c:1812
    #14 0x5790c84105c3 in _PyFunction_Vectorcall Objects/call.c:413
    #15 0x5790c8419a0e in _PyObject_VectorcallTstate Include/internal/pycore_call.h:168
    #16 0x5790c841b924 in method_vectorcall Objects/classobject.c:62
    #17 0x5790c8410ed7 in _PyObject_VectorcallTstate Include/internal/pycore_call.h:168
    #18 0x5790c8411018 in PyObject_Vectorcall Objects/call.c:327
    #19 0x5790c888107e in _PyEval_EvalFrameDefault Python/generated_cases.c.h:6205
    #20 0x5790c8472fda in _PyEval_EvalFrame Include/internal/pycore_ceval.h:119
    #21 0x5790c8473ce6 in gen_send_ex2 Objects/genobject.c:229
    #22 0x5790c84786a3 in gen_send_ex Objects/genobject.c:270
    #23 0x5790c8479898 in gen_close Objects/genobject.c:392
    #24 0x5790c8479dcd in _PyGen_Finalize Objects/genobject.c:106
    #25 0x5790c894557a in finalize_garbage Python/gc.c:980
    #26 0x5790c8947b80 in gc_collect_main Python/gc.c:1408
    #27 0x5790c8949657 in _PyGC_CollectNoFail Python/gc.c:1657
    #28 0x5790c89fc3c1 in finalize_modules Python/pylifecycle.c:1757
    #29 0x5790c8a08ab6 in _Py_Finalize Python/pylifecycle.c:2125
    #30 0x5790c8a08fbc in Py_FinalizeEx Python/pylifecycle.c:2252
    #31 0x5790c8ab41fa in Py_RunMain Modules/main.c:778
    #32 0x5790c8ab440a in pymain_main Modules/main.c:806
    #33 0x5790c8ab4787 in Py_BytesMain Modules/main.c:830
    #34 0x5790c813ab41 in main Programs/python.c:15
    #35 0x7c8cc7c35487  (/usr/lib/libc.so.6+0x27487) (BuildId: 0b707b217b15b106c25fe51df3724b25848310c0)
    #36 0x7c8cc7c3554b in __libc_start_main (/usr/lib/libc.so.6+0x2754b) (BuildId: 0b707b217b15b106c25fe51df3724b25848310c0)
    #37 0x5790c813aa64 in _start (/home/jacoblightning3/Documents/python/3.13debug/bin/python3.13+0x11a5a64) (BuildId: c2486f8b3c246fe979c4f6575406636585088b8a)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV Modules/_datetimemodule.c:2745 in delta_new
==9445==ABORTING

Jacoblightning avatar Apr 11 '25 18:04 Jacoblightning

On current 3.13 tip that line in datetime is https://github.com/python/cpython/blob/089c43f1601c88a2744fa960fd958ed05f741ba7/Modules/_datetimemodule.c#L2745 which doesn't have an obvious bug. This sort of crash could be the result of memory corruption earlier in the program.

JelleZijlstra avatar Apr 11 '25 18:04 JelleZijlstra

So, using GDB. I figured out that st on that line is a null pointer and the macro attempts to dereference it. st is created in the function _get_current_state which GDB says is returning NULL from line 168. The weird thing is that GDB won't let me step into or break on lines 166 or 167.

So, assuming that GDB is right on where _get_current_state is returning, get_module_state must have been called. https://github.com/python/cpython/blob/089c43f1601c88a2744fa960fd958ed05f741ba7/Modules/_datetimemodule.c#L102-L108

The other weird thing is that the assert on line 106 is not failing as it should (I have assertions turned on in my build) So I have to assume that GDB is wrong about where _get_current_state is returning. (So it must be either 157 or 163).

Since neither of those return paths in _get_current_state set *p_mod, wouldn't this be a bug since current_mod is not checked in delta_new and it is just assumed that st != NULL?

(I could be totally wrong on all this. :)

Jacoblightning avatar Apr 11 '25 20:04 Jacoblightning

Just realized that line numbers are different. Let me fix that

Jacoblightning avatar Apr 11 '25 20:04 Jacoblightning

Why _get_current_state returns NULL in the first place in this situation, I have no idea. (Actually, I do. both get_current_module and possibly PyImport_ImportModule("_datetime"); failed. But I don't know why this happens)

Jacoblightning avatar Apr 11 '25 20:04 Jacoblightning

@JelleZijlstra New Minimal Reproducible example:

import httpx

client = httpx.Client(
    # Any URL works
    base_url="https://duckduckgo.com"
)


def req():
    with client.stream("GET", "/") as r:
        yield

# Cannot be inlined into the iter. If inlined, segfault does not occur
stream = req()

# Any iteration works. I just simplified it down to this.
next(iter(stream)) 

Also, I found the line where python crashes:

https://github.com/encode/httpx/blob/9e8ab40369bd3ec2cc8bff37ab79bf5769c8b00f/httpx/_client.py#L158

Jacoblightning avatar Apr 11 '25 21:04 Jacoblightning

Ok, so I think it's an issue with datetime (and/or maybe with iterators?). Thank you very much for your investigation! I'll try to see if I can find something tomorrow or on Sunday

picnixz avatar Apr 11 '25 21:04 picnixz

Oh. Whoops. Didn't see that you changed the name

Jacoblightning avatar Apr 11 '25 22:04 Jacoblightning

Has this been tested with _pydatetime to verify the issue is definitely in datetime.c?

StanFromIreland avatar Apr 12 '25 08:04 StanFromIreland

There seems to be a subtle issue as well:

Python 3.14.0a7+ (heads/main:d4e2cdc15bd, Apr 12 2025, 10:58:46) [GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.modules['_datetime'] = None
>>> import httpx
...
... client = httpx.Client(
...     # Any URL works
...     base_url="https://duckduckgo.com"
... )
...
...
... def req():
...     with client.stream("GET", "/") as r:
...         yield
...
... # Cannot be inlined into the iter. If inlined, segfault does not occur
... stream = req()
...
... # Any iteration works. I just simplified it down to this.
... next(iter(stream))
...
>>>
>>> ^D
Exception ignored while closing generator <generator object req at 0x7f5fdcffd150>:
Traceback (most recent call last):
  File "<python-input-3>", line 10, in req
  File "/$HOME/lib/python/cpython/Lib/contextlib.py", line 162, in __exit__
  File "/$HOME/Applications/python3.12/local/lib/python3.12/site-packages/httpx/_client.py", line 877, in stream
  File "/$HOME/Applications/python3.12/local/lib/python3.12/site-packages/httpx/_models.py", line 971, in close
  File "/$HOME/lib/python/cpython/Lib/contextlib.py", line 305, in helper
TypeError: 'NoneType' object is not callable

Note that the exception being ignored while closing generator is only raised when exiting the interpreter, but the interpreter does not SIGSEV. Note that I'm using my 3.12 system-wide installation for the httpx package but other than that it shouldn't change anything. What's surprsing is that I cannot reproduce the crash itself with the latest main! I can reproduce the above issue however, but I cannot reproduce the SIGSEV.

picnixz avatar Apr 12 '25 09:04 picnixz

@Jacoblightning Can you try building the latest released version please?

picnixz avatar Apr 12 '25 09:04 picnixz

Still crashing for me on Python 3.14.0a7+ (heads/main:891465fc7a6, Apr 12 2025, 08:18:07) [GCC 14.2.1 20250207]. (Both debug and release) I believe that it's the same stacktrace (just with different line numbers)

Jacoblightning avatar Apr 12 '25 12:04 Jacoblightning

What os are you on?

StanFromIreland avatar Apr 12 '25 12:04 StanFromIreland

Oh btw, mine is openSUSE 15.5 and I was using gcc 7.5. So it could also be a GCC issue (or me not knowing how to check...)

picnixz avatar Apr 12 '25 12:04 picnixz

What os are you on?

Arch linux

Jacoblightning avatar Apr 12 '25 12:04 Jacoblightning

I can check on my windows VM too but I can't do that until noon.

Jacoblightning avatar Apr 12 '25 12:04 Jacoblightning

@ZeroIntensity Since you're on AL, can you check if the error also persists on your side? TiA

picnixz avatar Apr 12 '25 13:04 picnixz

I can reproduce this using a fresh build off main, but I don't think this is an issue with datetime. The crash seems to be happening in the eval loop, and anything that happens in datetime is probably just a side-effect of memory corruption.

My theory is that this has to do with reference counting problems on stackrefs + generator locals.

ZeroIntensity avatar Apr 12 '25 13:04 ZeroIntensity

@ZeroIntensity Can you also try with the Python implementation, if it also crashes then it would be a sideffect, otherwise datetime may be the culprit? I can test in a few hours on Linux.

StanFromIreland avatar Apr 12 '25 13:04 StanFromIreland

The crash doesn't occur with the Python implementation enabled, but Valgrind still explodes with errors. I'm pretty sure _datetime just acts as a trigger for the segfault.

ZeroIntensity avatar Apr 12 '25 13:04 ZeroIntensity

If httpx tries to import datetime lazily at BoundSyncStream.close(), an ImportError occurs even on 3.11:

https://github.com/python/cpython/blob/281fc338fdf57ef119e213bf1b2c772261c359c1/Lib/importlib/_bootstrap.py#L1246-L1248

The same error can happen in the current _datetimemodule.c even before module_clear() is invoked, by which we fail to get the valid/live pointer to the module state as you already discussed above:

https://github.com/python/cpython/blob/281fc338fdf57ef119e213bf1b2c772261c359c1/Modules/_datetimemodule.c#L172-L179

Triggered by d82a7ba041321e7b58a5a9bbc394670be6ceeb7c. I'll also check as much as I can.

cc @ericsnowcurrently

neonene avatar Apr 13 '25 19:04 neonene

I've posted my questions, keeping aside the _datetime. I'm not sure yet what need to be fixed.

https://discuss.python.org/t/strange-side-effect-of-the-generator-when-finally-clause-is-contained/88353

neonene avatar Apr 14 '25 23:04 neonene

Generator's different behaviors are still surprising to me:

def gen():
    try:
        print(1)
        yield 2
    finally:
        print(3)
  • print(next(it := gen()))
1
2
3
  • print(next(gen()))
1
3
2

neonene avatar Apr 21 '25 23:04 neonene