feat: release the GIL very minimally
This PR is a retry of #35 and #475.
~Instead of trying to release the GIL everywhere, which appears to break things, I am trying to only release for I/O critical operations.~
~To make further progress on better threaded performance, I think we'll need to separate out the cfitsio calls from the python C API calls so we can release the GIL for larger sections of code. My guess is that a bunch of time is being spent on initialization where the code goes back and forth.~
The approach here is to release the GIL for the biggest blocks of code as practically possible. There are some spots where this is hard because of the way python data structures are used to hold data as it is pulled from fitsio.
With this approach, I am seeing decent threaded performance once you do enough work in each thread. Threaded performance for reads is better than writes on my machine.
@fommil Can you try the code from this PR in your application? It turns out that releasing the GIL doesn't appear to help that much in actual testing. In many cases it hurts because the cost of acquiring the GIL doesn't out weigh the cost of the operation being done by the FITS library.
I'm happy to test it out but I'll need some help with that, I'm not strong in python. I'm using the system installed python3-fitsio on Debian 13, how can I override that temporarilly to use this branch?
Hmmmm. I am not sure TBH since you'll need a compiler to build the package. :/