Add 'with_python' (i.e. with_gil) to prange and parallel
This is mostly targetted at freethreading builds, but should run on normal builds (with a lot of deadlock avoidance).
It's designed to be a more efficient alternative to
for i in prange(...):
with gil:
...
I've renamed with_gil in this context to with_python and tried to clarify the docs here.
I've renamed
with_gilin this context towith_pythonand tried to clarify the docs here.
That sounds better. What do you think about something like with_thread_state_saved?
I've renamed
with_gilin this context towith_pythonand tried to clarify the docs here.That sounds better. What do you think about something like
with_thread_state_saved?
Probably too technical. I assume most users would rather not know about thread state at all. (Although I know the existing with gil/with nogil are similarly technical).
I prefer with_python because it really just describes what you can do.
So, I looked over this and read through the documentation snippets, but I still feel unsure about the goal and effects of this change. I gathered that the new with_python=True option assures a valid thread state for the OpenMP threads but does not itself control the GIL state, is that correct? Why can't all OpenMP threads have Python thread states right away, without a special option?
Could you maybe extend the PR description with some kind of problem description and motivation?
It's pretty close to writing
for i in prange(N, nogil=True):
with gil:
... # code
It's slightly more efficient (just by not using the PyGILState API as much) but not as much as I hoped when I first started looking at it - the big limitation is that Cython does have to release and re-acquire the GIL each time you go round the loop to avoid deadlock (or somehow eliminate last_private from the loop, which is is what requires the barrier).
Why can't all OpenMP threads have Python thread states right away, without a special option?
Mainly because it's a bit of a pitfall with the non-freethreading interpreter. It means that people that accidentally omit nogil=True will not get any parallelism even if they aren't using the GIL. So at least for me it seems better to make it explicit
It's pretty close to writing
for i in prange(N, nogil=True): with gil: ... # codeIt's slightly more efficient
So … can't we make the idiom above more efficient then? Is it only for in prange() directly followed by with gil or also a with gil appearing somewhere later in the (otherwise nogil) loop? The first should be easy to detect.
people that accidentally omit
nogil=Truewill not get any parallelism even if they aren't using the GIL
But isn't that expected? I think the documentation states pretty clearly that parallelism requires releasing the GIL. Wasn't it always the case that you get sequential "threads" if you run a prange loop without releasing the GIL?
I'm possibly not explaining very well
for i in prange(N, with_python=True):
... # code
is transformed to something pretty close to
for i in prange(N, nogil=True):
with gil:
... # code
(but a little faster)
people that accidentally omit
nogil=Truewill not get any parallelism even if they aren't using the GILBut isn't that expected? I think the documentation states pretty clearly that parallelism requires releasing the GIL. Wasn't it always the case that you get sequential "threads" if you run a
prangeloop without releasing the GIL?
Right now you get a compile-time error if you try to write a prange loop without releasing the GIL.
But yes - in principle we could just try to speed up prange(N, nogil=True): with gil: instead and just document that as the idiomatic approach. Given that it doesn't seem possible to safely go round the loop without releasing the GIL the benefit of this PR is fairly minor anyway
Right now you get a compile-time error if you try to write a
prangeloop without releasing the GIL.
Ah, so, we basically push people into releasing the GIL even if they directly need it afterwards, is that it? Ok, that use case doesn't seem hugely important to me – if you want parallelism, you probably want it for parallel code, not sequential code. But I can see that it can be useful to do some kind of Python level initialisation inside of the thread first. We also have parallel sections for stuff like that, but well, it's still a use case. And I think the obvious idiom for it is for in prange: with gil:. If there's a way to optimise such code, why not. I don't see why we should bother users with yet another option.
Do you have an example of real world code where this is used?
BTW, we don't have any benchmarks for our parallel code, neither micro nor macro. Since you're working on the freethreading and parallel code, maybe you could come up with at least some microbenchmarks for this kind of mixture between Python and nogil code? I'm not sure if they'll run well in CI (could be that we have a single CPU core there), but we'll see. At least sequential locking and critical sections might still be candidates for measuring their overhead.
Ah, so, we basically push people into releasing the GIL even if they directly need it afterwards, is that it? Ok, that use case doesn't seem hugely important to me – if you want parallelism, you probably want it for parallel code, not sequential code
The use case is that in freethreaded Python this code is parallel.
This is mainly targeted at people using freethreaded Python who want to use prange loops without manually messing around with the GIL. (On non-freethreaded Python the intention is that it works, but clearly the GIL means it won't work well).
The main limitation is that it isn't currently much better than writing for i in prange(N, nogil=True): with gil: ..., so it's mainly a better expression of intent than better C code.
I subsequently had a thought about how to avoid needing to release the GIL on each loop (at least in some cases) so maybe I should look at that first - it's easier to justify something new if it genuinely does lead to better performance
So I think I've been able to drop the "release the GIL on every go round", which should be an improvement.
A quick timing test:
# cython: freethreading_compatible=True
# distutils: extra_compile_args=-fopenmp
# distutils: extra_link_args=-fopenmp
from cython.parallel import prange
def f(i):
return i
def test_prange_with_python(int N):
cdef int i = 0
out = [None]*N
for i in prange(N, with_python=True):
out[i] = f(i)
def test_old_way(int N):
cdef int i = 0
out = [None]*N
for i in prange(N, nogil=True):
with gil:
out[i] = f(i)
def do_timing(N):
from datetime import datetime
t0 = datetime.now()
test_old_way(N)
t1 = datetime.now()
test_prange_with_python(N)
t2 = datetime.now()
print(f"old way:", t1-t0)
print(f"new way:", t2-t1)
On freethreading Python 3.13 N=1000000 the with_python version is consistently about 10% faster (i.e. noticeable but not huge):
old way: 0:00:00.249138
new way: 0:00:00.221372
On GIL Python 3.13 N=1000000 the with_python version is dramatically faster
old way: 0:00:28.616594
new way: 0:00:00.073102
(i.e. when the GIL is contended then constantly swapping it is absolutely awful).
So the upshot is, it's a mild improvement for the freethreading builds (for which it's intended) and a dramatic improvement for the GIL build (where it isn't intended)
freethreading_compatible=True
Isn't this a good indicator for whether we should warn about the GIL being released or not when we enter a prange loop? Even with your speed improvement, it still matters in non-FT builds, right?
Isn't this a good indicator for whether we should warn about the GIL being released or not when we enter a prange loop?
I think what you're suggesting is:
- Remove the explicit
with_gildirective, - If the user omits
nogil=Truethen don't error - let them use the GIL. - Warn if they don't release the GIL and don't mark it as
freethreading_compatible
?
If so, that seems reasonable to me.
Even with your speed improvement, it still matters in non-FT builds, right?
Yes - on non-FT builds it's running sequentially, just not fighting over the GIL quite so often.
- Remove the explicit
with_gildirective,- If the user omits
nogil=Truethen don't error - let them use the GIL.- Warn if they don't release the GIL and don't mark it as
freethreading_compatible
Yes, that's what I was thinking.