Reduce contention on free-threaded build
This uses thread-local storage to avoid a scaling bottleneck due to a global cache in CPython internals. Unfortunately it's pretty slow ~10% of the total runtime of my benchmark is in the type_lookup function, I'm going to look at whether it can be faster...
Also refactors some rust code so it doesn't block other threads from doing work if the interpreter requests a GC pass.
I created a bug in the cpython issue tracker: https://github.com/python/cpython/issues/132380
Based on my benchmark runs, my proposed PR for 3.13 removes the need for type_lookup() function.
I implemented an alternative fix for CPython, which we can also backport to Python 3.13. It adds a specialized and faster type lookup function if using a non-interned name:
https://github.com/python/cpython/pull/132652
FWIW the latest upstream patch for this is https://github.com/python/cpython/pull/133669. I think @nascheme is still planning to have a fix merged before 3.14rc1.
How much of this pr do you think is still useful after that patch lands in cpython?
The avoiding holding the GIL in the native library helps a little.
I haven't actually been able to find a way to beat multiprocessing using libcst. There's still some contention due to GC pauses.