dnaio Test limited API

After watching https://ep2024.europython.eu/session/cython-and-the-limited-api/, I wanted to test how well switching to the stable ABI would work.

It would be great to be able to build a single wheel for each platform and not have to update when a new Python version gets released.

There are two issues. The first is that using the limited API makes dnaio slower. Reading FASTQ files takes 50% more time. I did not benchmark anything else. It was actually about 10 time slower when I used the limited API for Python 3.7. (This PR uses what is available in Python 3.12.) Maybe this will get better in newer versions.

The second issue is that still some non-ABI3 symbols are being used in our code, in particular PyUnicode_New.

abi3audit (https://github.com/pypa/abi3audit) outputs this:

$ abi3audit -v --assume-minimum-abi3 3.12 src/dnaio/_core.abi3.so
[09:47:43] 👎 : _core.abi3.so has non-ABI3 symbols
           ┏━━━━━━━━━━━━━━━━━━━━━━━━━┓
           ┃ Symbol                  ┃
           ┡━━━━━━━━━━━━━━━━━━━━━━━━━┩
           │ PyUnicode_New           │
           │ Py_CompileStringExFlags │
           │ _Py_FatalErrorFunc      │
           └─────────────────────────┘
           💁 _core.abi3.so: 1 extensions scanned; 0 ABI version mismatches and 3 ABI violations found

~~This PR is just meant to document the experiment, so I’ll close it right away.~~

Nov 12 '24 09:11 marcelm

PyUnicode_GET_LENGTH and PyUnicode_DATA depend on the internal representation of the string object. As such they can never be stable. This is a bit problematic, because dnaio depends on those macros a lot for SequenceObject operations. If we cannot rely on these, the speed will slow down to python speeds (Calling len() and requiring to unwrap the object as an integer).

So it is not just the parsing, but a lot of operations on the SequenceRecord objects will get slower as well.

Nov 12 '24 10:11 rhpvorderman

I’m not saying we have to do this but want to document what I found.

There’s also PyUnicode_GetLength, which is in the stable ABI. It’s not a macro, so it’ll be slightly more expensive than PyUnicode_GET_LENGTH, but it’s probably not be too bad.

We use PyUnicode_DATA for reading the data in a unicode object and also for creating new ones (with PyUnicode_New).

Victor Stinner proposed PEP 756. The PEP was ultimately witdrawn, but what I got out of the discussion is that it’s perhaps an alternative to use PyUnicode_AsUTF8AndSize for the case where we just want to read the data. If I understand correctly, then the function just returns the pointer to the ASCII data if it uses the ASCII representation anyway.

To replace usage of PyUnicode_New (needed, for example, in FastqIter.__next__ for the attributes of the SequenceRecord), we would need to switch to PyUnicode_DecodeUTF8 or PyUnicode_DecodeASCII as appropriate. My guess is that the overhead may also not be too bad. we would no longer need do an explicit memcpy(), and for ASCII data, we could remove our custom check for whether the data is ASCII and let PyUnicode_DecodeASCII do that check.

Nov 12 '24 15:11 marcelm

Interesting suggestions. PyUnicode_DecodeASCII was quite a lot slower because it checks all strings individually rather than checking one big block and using memcpy. It also does the ASCII check and copying simultaneously: https://github.com/python/cpython/blob/main/Objects/unicodeobject.c#L4992

PyUnicode_DecodeLatin1 does what we do now, and does a find_max_char call to find the maximum size. So it should be faster than PyUnicode_DecodeASCII (allthough this needs benchmarking). If we keep our own ASCII check in we can still guarantee there is no violation of the FASTQ spec. Our custom ASCII check is much faster because it does not expect to find errors and checks larger blocks. (This is a typical case of tuning the algorithm to fit the data, something that the generic API calls in CPython cannot do.)

I could not find a stable ABI equivalent for PyUnicode_New. PyUnicode_FromKindAndData leads to PyUnicode_From1ByteKind which leads to PyUnicode_DecodeLatin1, which I already researched.

This is indeed interesting to pursue. If raw speed was the goal this library would need to be written entirely in C anyway (yes, I have a branch somewhere that proves this...), but this library was written in Cython to make it easier. Limiting the code to the stable API is another step in reducing the maintenance burden.

Nov 13 '24 08:11 rhpvorderman

I don't know how it will work out for Cython, but I tried it in pure C. I ran into a few limitations. Cython will also have to mitigate for these limitations.

When throwing errors Py_TYPE(obj)->tp_name is not available, PyType_GetName should be used instead. That is only available from thea 3.11 stable ABI onwards.
PyBUF_READ and PyBUF_WRITE only available from 3.11 stable ABI. PyMemoryView_FromMemory is available from 3.4 but its last argument can only be filled from 3.11 onwards.
Creating custom deallocators is harder as accessing the tp_free slot now requires PyType_GetSlot rather than direct access.
Only heap types are allowed.Types cannot be referenced in code as they are not static. They float somewhere in memory. This makes calling PyObject_New very hard as you have to find where it is floating in memory. This requires fiddling with module states and passing the module as a parameter.

Probably Cython can work with the stable ABI before 3.11 but that will involve a lot of hacking on their side. The simplest things become very hard when this limitation is present. A simple library with a few C functions, sure, that will work, but having classes that are custom iterators... I don't know. I will look at it again when the 3.14 release is imminent.

Nov 20 '24 12:11 rhpvorderman

Ok. I managed to convert sequali. Stable ABI version 3.10 (with a minor hack: defining PyBUF_READ and PyBUF_WRITE).

PyUnicode_AsUTF8AndSize works pretty well and will suite most of our string reading cases. The ASCII check is also very simple. See this commit for details: https://github.com/rhpvorderman/sequali/commit/9a646946fe63c5bd374be5685476c070d7d9595b

For dnaio it is best to wait for ABI 3.11 as that includes the entire buffer protocol, and that is used for the chunking.

Currently I am looking into how to package all this and found this example project https://github.com/joerick/python-abi3-package-sample.

I found another reason to go to the stable ABI. And that is that it requires the usage of heap types rather than statically defined types in the code. This makes type checking harder, so a module state is required. This leads to having a properly defined module setup with a state and all the proper initialization required for proper module isolation. https://docs.python.org/3/howto/isolating-extensions.html Technically this can also be done without the stable ABI, but the stable ABI makes it pretty much the only option. Isolating modules is important in a free-threading python future, and I am quite sure many bioinformaticians will jump on the opportunity to use dnaio in that context.

Sorry for the info dump here, but I figure it would be nice to be able to reference to something when implementing the stable ABI.

Nov 27 '24 15:11 rhpvorderman

Right, I am in the process of releasing a bugfix for Sequali, which uses the stable ABI.

And it crashes in the deployment phase because 3.14 has a threaded build that is not experimental. And the stable ABI is not supported by threaded builds.

For a bioinformatics library, that is simply a non-starter. The threading feature is going to be used a lot by bioinformatics programs in the feature. Sequali can sort of "get away with it", maybe. I regret doing all this porting work to the stable ABI only to find out that I cannot support threading builds later. What a waste of time.

Sep 09 '25 11:09 rhpvorderman

Hmm. Thinking about it, it could work. If I set the stable ABI stuff not in the source code itself but allow it is a preprocessor option running at compile time. But that would just be a lot of hassle.

Sep 09 '25 11:09 rhpvorderman

A stable ABI will likely come with Python 3.15, see PEP 803. The cool thing is that the wheels would then work with both the GIL-enabled Python and the free-threaded one.

Oct 06 '25 09:10 marcelm