Bug report

I found a bug that seems to be code corruption.

While working on an example project with ~70 threads, I occasionally (once every hour or so) get the following exception from various locks:

Exception in thread Sequence 2:
Traceback (most recent call last):
  File "/Volumes/RAMDisk/installed-nogil-main/Library/Frameworks/Python.framework/Versions/3.14/lib/python3.14/threading.py", line 1039, in _bootstrap_inner
    self.run()
    ~~~~~~~~^^
  ...
  File "/Volumes/RAMDisk/installed-nogil-main/Library/Frameworks/Python.framework/Versions/3.14/lib/python3.14/threading.py", line 656, in wait
    with self._cond:
         ^^^^^^^^^^
  File "/Volumes/RAMDisk/installed-nogil-main/Library/Frameworks/Python.framework/Versions/3.14/lib/python3.14/threading.py", line 304, in __enter__
    return self._lock.__enter__()
           ~~~~~~~~~~~~~~~~~~~~^^
TypeError: descriptor '__exit__' for '_thread.RLock' objects doesn't apply to a '_thread.lock' object

or

Exception in thread Clock:
Traceback (most recent call last):
  File "/Volumes/RAMDisk/installed-nogil-main/Library/Frameworks/Python.framework/Versions/3.14/lib/python3.14/threading.py", line 1039, in _bootstrap_inner
    self.run()
    ~~~~~~~~^^
  File "europython.py", line 113, in run
    for message in input:
  File "/Volumes/RAMDisk/nogil-ep-temp/lib/python3.14/site-packages/mido/ports.py", line 243, in __iter__
    yield self.receive()
          ~~~~~~~~~~~~^^
  File "/Volumes/RAMDisk/nogil-ep-temp/lib/python3.14/site-packages/mido/ports.py", line 215, in receive
    with self._lock:
         ^^^^^^^^^^
TypeError: descriptor '__exit__' for '_thread.lock' objects doesn't apply to a '_thread.RLock' object

This looks like it's a bug related to locks but it isn't. It's not even related to descriptors, only descriptors nicely refuse running invalid code.

This issue is also externally reported in https://github.com/PyWavelets/pywt/issues/758 with the same Lock descriptor error message I've seen, and I can reproduce the failure locally, albeit with a different exception:

TypeError: descriptor 'sort' for 'numpy.ndarray' objects doesn't apply to a 'ThreadPoolExecutor' object

To reproduce this with cpython main, do the following:

make a venv with a free-threaded build of Python
install Cython from main with pip install -e .
install Numpy from main with pip install . --no-build-isolation (important: no -e in this case)
install pywt from main with pip install -e . --no-build-isolation (important: you DO need -e in this case)
run pytest in a loop (or with autoclave) like this: PYTHON_GIL=0 pytest pywt/tests/test_concurrent.py

You will need to run this for a longer while to get to a failure.

By doing this, I managed to find this particular failure case:

self = <concurrent.futures.thread.ThreadPoolExecutor object at 0x225be124750>, fn = functools.partial(<function dwtn at 0x225be6d3b40>, wavelet='haar')
args = (array([[1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., ...., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.]]),)
kwargs = {}, f = <Future at 0x225bea44310 state=finished returned dict>, w = <concurrent.futures.thread._WorkItem object at 0x225bc875bf0>

    def submit(self, fn, /, *args, **kwargs):
        with self._shutdown_lock, _global_shutdown_lock:
            if self._broken:
                raise BrokenThreadPool(self._broken)

            if self._shutdown:
                raise RuntimeError('cannot schedule new futures after shutdown')
            if _shutdown:
                raise RuntimeError('cannot schedule new futures after '
                                   'interpreter shutdown')

            f = _base.Future()
            w = _WorkItem(f, fn, args, kwargs)

            self._work_queue.put(w)
>           self._adjust_thread_count()
E           TypeError: Future.set_result() missing 1 required positional argument: 'result'

args       = (array([[1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., ...., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.]]),)
f          = <Future at 0x225bea44310 state=finished returned dict>
fn         = functools.partial(<function dwtn at 0x225be6d3b40>, wavelet='haar')
kwargs     = {}
self       = <concurrent.futures.thread.ThreadPoolExecutor object at 0x225be124750>
w          = <concurrent.futures.thread._WorkItem object at 0x225bc875bf0>

../installed-nogil-main/Library/Frameworks/Python.framework/Versions/3.14/lib/python3.14/concurrent/futures/thread.py:179: TypeError
======================================================================================================== short test summary info ========================================================================================================
FAILED pywt/tests/test_concurrent.py::test_concurrent_dwt - TypeError: Future.set_result() missing 1 required positional argument: 'result'
====================================================================================================== 1 failed, 3 passed in 0.44s ======================================================================================================
-- 863 runs, 862 passes, 1 failure, 734486 msec

Observe how Python wants to call self._adjust_thread_count() (with no arguments) but ends up calling f.set_result(), which causes an exception due to no arguments being passed.

Tested on macOS Sonoma on M1 Max with Python 3.14.0a0 experimental free-threading build (heads/main:7a807c3efaa, Jul 2 2024, 11:58:38).

AFAICT the problem only occurs with the GIL actually disabled.

Jul 04 '24 16:07 ambv

By the way I hit this last week with 3.13.0b2 so this is likely a problem on the 3.13 branch as well and not something new in 3.14.

Jul 04 '24 16:07 ngoldbaum

Another interesting failure from running the pywt tests that shows this is beyond descriptors (despite the exception):

../installed-nogil-main/Library/Frameworks/Python.framework/Versions/3.14/lib/python3.14/concurrent/futures/_base.py:322: in __init__
    self._condition = threading.Condition()
        self       = <[AttributeError("'Future' object has no attribute '_condition'") raised in repr()] Future object at 0x5ddd8796110>
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <[AttributeError("'Condition' object has no attribute '_waiters'") raised in repr()] Condition object at 0x5ddd8795790>
lock = <unlocked _thread.RLock object owner=0 count=0 at 0x5ddd881a300>

    def __init__(self, lock=None):
        if lock is None:
            lock = RLock()
        self._lock = lock
        # Export the lock's acquire() and release() methods
        self.acquire = lock.acquire
>       self.release = lock.release
E       TypeError: descriptor 'result_type' for 'numpy._core._multiarray_umath._array_converter' objects doesn't apply to a '_thread.RLock' object

Look at the failing self reprs. In both cases we're talking about attributes set in __init__ and never unset/deleted.

Jul 04 '24 18:07 ambv

I have tried to reproduce the above issue with pywt, and I'm routinely seeing errors like:

TypeError: fft() got an unexpected keyword argument 'axes'

Has the NumPy argument parsing been made thread-safe yet?

Jul 04 '24 19:07 colesbury

No, that PR is still in-flight: https://github.com/numpy/numpy/pull/26780

Jul 04 '24 19:07 ngoldbaum

I should probably mention, I'm testing all this with CFLAGS="-g0 -O3" and ./configure --enable-optimizations --with-lto --disable-gil.

Jul 04 '24 19:07 ambv

Minimal repro so far

This is somewhat frustrating to reproduce without any larger dependencies or codebases. The smallest thing I managed to get to crash is this:

# repro4.py

from concurrent import futures
import os
import random

def calculate(arr):
    return sum(arr) / len(arr)

print(os.environ["PYTHONHASHSEED"], end=" ", flush=True)

for _ in range(100):
    with futures.ThreadPoolExecutor(max_workers=os.cpu_count()) as ex:
        arrs = [[random.random() for _ in range(100)] for _ in range(50)]
        results = list(ex.map(calculate, arrs))

Run it with

for seed in (seq -f%1.0f 1000000 1100000)
      PYTHONHASHSEED=$seed python repro4.py
  end

with the fish shell. On the bash shell (seq -f%1.0f 1000000 1099999 | while read s; do PYTHONHASHSEED=$s python3.14 repro4.py; done) it reproduces much less often.

The hash seed is a red herring but, believe me, if I remove that bit, I can't repro no more.

The result of running this for a longer while is a similar unexpected code execution, one example being:

  File "/Users/ambv/Documents/Python/europython/repro4.py", line 13, in <module>
    results = list(ex.map(calculate, arrs))
  File "/Volumes/RAMDisk/installed-nogil-main/Library/Frameworks/Python.framework/Versions/3.14/lib/python3.14/concurrent/futures/_base.py", line 611, in result_iterator
    yield _result_or_cancel(fs.pop())
          ~~~~~~~~~~~~~~~~~^^^^^^^^^^
  File "/Volumes/RAMDisk/installed-nogil-main/Library/Frameworks/Python.framework/Versions/3.14/lib/python3.14/concurrent/futures/_base.py", line 309, in _result_or_cancel
    return fut.result(timeout)
           ~~~~~~~~~~^^^^^^^^^
  File "/Volumes/RAMDisk/installed-nogil-main/Library/Frameworks/Python.framework/Versions/3.14/lib/python3.14/threading.py", line 526, in release
    if n < 1:
       ^^^^^
TypeError: '<' not supported between instances of 'NoneType' and 'int'

where the jump from calling concurrent.futures.Future.result(timeout) (where timeout=None) is unexpectedly being executed by threading.Semaphore.release(n=1) that doesn't expect n to ever be None. It's very weird. I managed to run pdb when this situation happens and running fut.result(None) returns the expected outcome. So this must be some momentary corruption.

Jul 04 '24 20:07 ambv

Thanks, I'm able to reproduce it intermittently now. So far only on my macOS arm64 laptop, not on x86-64 Linux. Reducing MCACHE_SIZE_EXP in pycore_typeobject.h seems to make it happen more quickly -- I changed it to 6 locally.

Jul 04 '24 21:07 colesbury

The bug is in our seq lock implementation:

https://github.com/python/cpython/blob/cb688bab08559079d0ee9ffd841dd6eb11116181/Python/lock.c#L550-L560

https://github.com/python/cpython/blob/cb688bab08559079d0ee9ffd841dd6eb11116181/Objects/typeobject.c#L5390-L5402

The memory ordering on the _PySeqLock_EndRead isn't sufficient. We need an explicit fence (i.e., atomic_thread_fence). The "acquire" prevents subsequent loads from being reordered before it, but in _PySeqLock_EndRead we want to prevent earlier loads from being reordered after the _PySeqLock_EndRead().

The bug occurs on arm64, but not x86-64, because x86-64 enforces ordering of loads relative to each other.

Jul 04 '24 23:07 colesbury

With the PRs merged, is there anything left to do here?

Jul 10 '24 03:07 itamaro

I close the issue.

Jul 10 '24 13:07 vstinner

TypeError: descriptor 'some_method' for 'A' objects doesn't apply to a 'B' object

Bug report

Minimal repro so far