cpython icon indicating copy to clipboard operation
cpython copied to clipboard

gh-141594: A free-threaded JIT

Open Fidget-Spinner opened this issue 1 month ago • 5 comments

This PR gives the JIT free-threading support. It is only on for single-threaded code, and turns off automatically on multi-threaded code. All JIT features are turned on, including the optimizer. All tests pass on my system, including TSAN as of https://github.com/python/cpython/pull/141595/commits/527aac16fda441fb689c2782ef2ff3af3fb4d860 usng the FT suppression file, except for the usual spurious race conditions already in existing CPython.

Benchmark results are good. Geomean of 2% faster on FT+JIT vs just FT. So the JIT provides a 2% geomean speedup on FT for pyperformance! Compare https://github.com/facebookexperimental/free-threading-benchmarking/tree/main/results/bm-20251116-3.15.0a1+-763ebea-JIT,NOGIL against https://github.com/facebookexperimental/free-threading-benchmarking/tree/main/results/bm-20251115-3.15.0a1+-ed73c90-NOGIL. For convenience, I downloaded the json files and pyperf compared them here https://gist.github.com/Fidget-Spinner/d79f2f050147fc355a300bf9d413a75e

Design: Creation of >1 threads cause global invalidation of all executors and disables JIT.

In the future, we can do lock removal too. The performance figures from the lock removal are promising (no lock removal vs lock removal). All with FT+PGO+LTO=thin+TC+JIT + pyperf system tune:

Mean +- std dev: [nogil-float-lock] 69.9 ms +- 1.5 ms -> [nogil-float-nolock] 54.9 ms +- 0.8 ms: 1.27x faster
Mean +- std dev: [nogil-nbody-lock] 139 ms +- 1 ms -> [nogil-nbody-nolock] 115 ms +- 3 ms: 1.21x faster
Mean +- std dev: [nogil-richards-lock] 37.3 ms +- 0.4 ms -> [nogil-richards-nolock] 35.8 ms +- 0.3 ms: 1.04x faster 
Mean +- std dev: [nogil-deltablue-lock] 3.50 ms +- 0.15 ms -> [nogil-deltablue-nolock] 3.36 ms +- 0.14 ms: 1.04x faster

This is on a platform where locking/atomics are somewhat slow (i7-12700h). I removed the lock removal code for this PR to reduce the diff.

  • Issue: gh-141594

Fidget-Spinner avatar Nov 15 '25 16:11 Fidget-Spinner

@sergey-miryanov in the future could you please bundle up your reviews and send them in one review instead of multiple reviews?

Fidget-Spinner avatar Nov 15 '25 20:11 Fidget-Spinner

Sorry, will do next time!

sergey-miryanov avatar Nov 15 '25 20:11 sergey-miryanov

This PR has some serious issues:

  • Moving all the JIT state to the thread does not make things thread safe. It might do the opposite by introducing race conditions.
  • There is a possible use after free bug, when an executor created by one thread is run in another
  • It leaks an unbounded amount of memory, as it doesn't collect executors.
  • It waste memory allocating cold executors per thread

Executors are not "effectively immortal"; cold executors do get freed.

There are some useful improvements here, such as locking when inserting/removing executors from the linked list, but those should be in separate small PRs.

Finally, we really need to clean up after adding the tracing JIT frontend before making any intrusive changes like this.

markshannon avatar Nov 17 '25 11:11 markshannon

  • It leaks an unbounded amount of memory, as it doesn't collect executors.

This isn't true? The GC is now responsible for collecting executors.

Executors are not "effectively immortal"; cold executors do get freed.

Let me rephrase, they are effectively GC-only, not refcounted. As they always form reference cycles. So there's no point refcounting them. Just leave it to the GC.

Finally, we really need to clean up after adding the tracing JIT frontend before making any intrusive changes like this.

Yeap that I agree on.

Fidget-Spinner avatar Nov 17 '25 11:11 Fidget-Spinner

@markshannon I reduced the diff to the minimal changeset just to get everything working. This is just +50 lines of code roughly now!

Fidget-Spinner avatar Nov 17 '25 12:11 Fidget-Spinner