cython Add 'with_python' (i.e. with_gil) to prange and parallel

This is mostly targetted at freethreading builds, but should run on normal builds (with a lot of deadlock avoidance).

It's designed to be a more efficient alternative to

for i in prange(...):
  with gil:
    ...

Dec 13 '24 21:12 da-woods

I've renamed with_gil in this context to with_python and tried to clarify the docs here.

Jan 08 '25 21:01 da-woods

I've renamed with_gil in this context to with_python and tried to clarify the docs here.

That sounds better. What do you think about something like with_thread_state_saved?

Jan 08 '25 21:01 lysnikolaou

I've renamed with_gil in this context to with_python and tried to clarify the docs here.

That sounds better. What do you think about something like with_thread_state_saved?

Probably too technical. I assume most users would rather not know about thread state at all. (Although I know the existing with gil/with nogil are similarly technical).

I prefer with_python because it really just describes what you can do.

Jan 08 '25 21:01 da-woods

So, I looked over this and read through the documentation snippets, but I still feel unsure about the goal and effects of this change. I gathered that the new with_python=True option assures a valid thread state for the OpenMP threads but does not itself control the GIL state, is that correct? Why can't all OpenMP threads have Python thread states right away, without a special option?

Could you maybe extend the PR description with some kind of problem description and motivation?

May 31 '25 11:05 scoder

It's pretty close to writing

for i in prange(N, nogil=True):
    with gil:
        ... # code

It's slightly more efficient (just by not using the PyGILState API as much) but not as much as I hoped when I first started looking at it - the big limitation is that Cython does have to release and re-acquire the GIL each time you go round the loop to avoid deadlock (or somehow eliminate last_private from the loop, which is is what requires the barrier).

Why can't all OpenMP threads have Python thread states right away, without a special option?

Mainly because it's a bit of a pitfall with the non-freethreading interpreter. It means that people that accidentally omit nogil=True will not get any parallelism even if they aren't using the GIL. So at least for me it seems better to make it explicit

May 31 '25 11:05 da-woods

It's pretty close to writing
for i in prange(N, nogil=True):
    with gil:
        ... # code
It's slightly more efficient

So … can't we make the idiom above more efficient then? Is it only for in prange() directly followed by with gil or also a with gil appearing somewhere later in the (otherwise nogil) loop? The first should be easy to detect.

people that accidentally omit nogil=True will not get any parallelism even if they aren't using the GIL

But isn't that expected? I think the documentation states pretty clearly that parallelism requires releasing the GIL. Wasn't it always the case that you get sequential "threads" if you run a prange loop without releasing the GIL?

May 31 '25 13:05 scoder

I'm possibly not explaining very well

for i in prange(N, with_python=True):
    ... # code

is transformed to something pretty close to

for i in prange(N, nogil=True):
    with gil:
        ... # code

(but a little faster)

people that accidentally omit nogil=True will not get any parallelism even if they aren't using the GIL

But isn't that expected? I think the documentation states pretty clearly that parallelism requires releasing the GIL. Wasn't it always the case that you get sequential "threads" if you run a prange loop without releasing the GIL?

Right now you get a compile-time error if you try to write a prange loop without releasing the GIL.

But yes - in principle we could just try to speed up prange(N, nogil=True): with gil: instead and just document that as the idiomatic approach. Given that it doesn't seem possible to safely go round the loop without releasing the GIL the benefit of this PR is fairly minor anyway

May 31 '25 13:05 da-woods

Right now you get a compile-time error if you try to write a prange loop without releasing the GIL.

Ah, so, we basically push people into releasing the GIL even if they directly need it afterwards, is that it? Ok, that use case doesn't seem hugely important to me – if you want parallelism, you probably want it for parallel code, not sequential code. But I can see that it can be useful to do some kind of Python level initialisation inside of the thread first. We also have parallel sections for stuff like that, but well, it's still a use case. And I think the obvious idiom for it is for in prange: with gil:. If there's a way to optimise such code, why not. I don't see why we should bother users with yet another option.

Do you have an example of real world code where this is used?

May 31 '25 13:05 scoder

BTW, we don't have any benchmarks for our parallel code, neither micro nor macro. Since you're working on the freethreading and parallel code, maybe you could come up with at least some microbenchmarks for this kind of mixture between Python and nogil code? I'm not sure if they'll run well in CI (could be that we have a single CPU core there), but we'll see. At least sequential locking and critical sections might still be candidates for measuring their overhead.

May 31 '25 14:05 scoder

Ah, so, we basically push people into releasing the GIL even if they directly need it afterwards, is that it? Ok, that use case doesn't seem hugely important to me – if you want parallelism, you probably want it for parallel code, not sequential code

The use case is that in freethreaded Python this code is parallel.

This is mainly targeted at people using freethreaded Python who want to use prange loops without manually messing around with the GIL. (On non-freethreaded Python the intention is that it works, but clearly the GIL means it won't work well).

The main limitation is that it isn't currently much better than writing for i in prange(N, nogil=True): with gil: ..., so it's mainly a better expression of intent than better C code.

I subsequently had a thought about how to avoid needing to release the GIL on each loop (at least in some cases) so maybe I should look at that first - it's easier to justify something new if it genuinely does lead to better performance

May 31 '25 17:05 da-woods

So I think I've been able to drop the "release the GIL on every go round", which should be an improvement.

A quick timing test:

# cython: freethreading_compatible=True

# distutils: extra_compile_args=-fopenmp
# distutils: extra_link_args=-fopenmp

from cython.parallel import prange

def f(i):
    return i

def test_prange_with_python(int N):
    cdef int i = 0

    out = [None]*N

    for i in prange(N, with_python=True):
        out[i] = f(i)
    
def test_old_way(int N):
    cdef int i = 0

    out = [None]*N

    for i in prange(N, nogil=True):
        with gil:
            out[i] = f(i)

def do_timing(N):
    from datetime import datetime
    t0 = datetime.now()
    test_old_way(N)
    t1 = datetime.now()
    test_prange_with_python(N)
    t2 = datetime.now()

    print(f"old way:", t1-t0)
    print(f"new way:", t2-t1)

On freethreading Python 3.13 N=1000000 the with_python version is consistently about 10% faster (i.e. noticeable but not huge):

old way: 0:00:00.249138
new way: 0:00:00.221372

On GIL Python 3.13 N=1000000 the with_python version is dramatically faster

old way: 0:00:28.616594
new way: 0:00:00.073102

(i.e. when the GIL is contended then constantly swapping it is absolutely awful).

So the upshot is, it's a mild improvement for the freethreading builds (for which it's intended) and a dramatic improvement for the GIL build (where it isn't intended)

May 31 '25 19:05 da-woods

freethreading_compatible=True

Isn't this a good indicator for whether we should warn about the GIL being released or not when we enter a prange loop? Even with your speed improvement, it still matters in non-FT builds, right?

Jun 01 '25 03:06 scoder

Isn't this a good indicator for whether we should warn about the GIL being released or not when we enter a prange loop?

I think what you're suggesting is:

Remove the explicit with_gil directive,
If the user omits nogil=True then don't error - let them use the GIL.
Warn if they don't release the GIL and don't mark it as freethreading_compatible

?

If so, that seems reasonable to me.

Even with your speed improvement, it still matters in non-FT builds, right?

Yes - on non-FT builds it's running sequentially, just not fighting over the GIL quite so often.

Jun 01 '25 07:06 da-woods

Remove the explicit with_gil directive,

If the user omits nogil=True then don't error - let them use the GIL.

Warn if they don't release the GIL and don't mark it as freethreading_compatible

Yes, that's what I was thinking.

Jun 02 '25 06:06 scoder