gh-141594: A free-threaded JIT
This PR gives the JIT free-threading support. It is only on for single-threaded code, and turns off automatically on multi-threaded code. All JIT features are turned on, including the optimizer. All tests pass on my system, including TSAN as of https://github.com/python/cpython/pull/141595/commits/527aac16fda441fb689c2782ef2ff3af3fb4d860 usng the FT suppression file, except for the usual spurious race conditions already in existing CPython.
Benchmark results are good. Geomean of 2% faster on FT+JIT vs just FT. So the JIT provides a 2% geomean speedup on FT for pyperformance! Compare https://github.com/facebookexperimental/free-threading-benchmarking/tree/main/results/bm-20251116-3.15.0a1+-763ebea-JIT,NOGIL against https://github.com/facebookexperimental/free-threading-benchmarking/tree/main/results/bm-20251115-3.15.0a1+-ed73c90-NOGIL. For convenience, I downloaded the json files and pyperf compared them here https://gist.github.com/Fidget-Spinner/d79f2f050147fc355a300bf9d413a75e
Design: Creation of >1 threads cause global invalidation of all executors and disables JIT.
In the future, we can do lock removal too. The performance figures from the lock removal are promising (no lock removal vs lock removal). All with FT+PGO+LTO=thin+TC+JIT + pyperf system tune:
Mean +- std dev: [nogil-float-lock] 69.9 ms +- 1.5 ms -> [nogil-float-nolock] 54.9 ms +- 0.8 ms: 1.27x faster
Mean +- std dev: [nogil-nbody-lock] 139 ms +- 1 ms -> [nogil-nbody-nolock] 115 ms +- 3 ms: 1.21x faster
Mean +- std dev: [nogil-richards-lock] 37.3 ms +- 0.4 ms -> [nogil-richards-nolock] 35.8 ms +- 0.3 ms: 1.04x faster
Mean +- std dev: [nogil-deltablue-lock] 3.50 ms +- 0.15 ms -> [nogil-deltablue-nolock] 3.36 ms +- 0.14 ms: 1.04x faster
This is on a platform where locking/atomics are somewhat slow (i7-12700h). I removed the lock removal code for this PR to reduce the diff.
- Issue: gh-141594
@sergey-miryanov in the future could you please bundle up your reviews and send them in one review instead of multiple reviews?
Sorry, will do next time!
This PR has some serious issues:
- Moving all the JIT state to the thread does not make things thread safe. It might do the opposite by introducing race conditions.
- There is a possible use after free bug, when an executor created by one thread is run in another
- It leaks an unbounded amount of memory, as it doesn't collect executors.
- It waste memory allocating cold executors per thread
Executors are not "effectively immortal"; cold executors do get freed.
There are some useful improvements here, such as locking when inserting/removing executors from the linked list, but those should be in separate small PRs.
Finally, we really need to clean up after adding the tracing JIT frontend before making any intrusive changes like this.
- It leaks an unbounded amount of memory, as it doesn't collect executors.
This isn't true? The GC is now responsible for collecting executors.
Executors are not "effectively immortal"; cold executors do get freed.
Let me rephrase, they are effectively GC-only, not refcounted. As they always form reference cycles. So there's no point refcounting them. Just leave it to the GC.
Finally, we really need to clean up after adding the tracing JIT frontend before making any intrusive changes like this.
Yeap that I agree on.
@markshannon I reduced the diff to the minimal changeset just to get everything working. This is just +50 lines of code roughly now!