dnaio Identifying bottlenecks

A while ago I watched some videos on optimization and having a look at generated assembly code. Since dnaio is a relatively straightforward project. I decided to have a go at it. Unfortunately Cython generates not so nice C code. So I decided to rewrite the core functionality in C. It is in the allinc branch.

It turns out that this makes reading quite substantially faster. The writing is at the same speed. So Cython apparantly has some friction when creating an iterator that generates a lot of objects. We knew this already actually, so this rewrite didn't learn me anything new. But it was fun to read records "superfaster" than what we are used to, so I wanted to share that here.

Looking at the C code. I don't see any way we can make our algorithms any faster. I think the code is already very close to the machine and I don't see how we can speed it up. I tried using x86 intrinsics instead of memchr, but apparently GCC already does this with the builtin memchr optimizations so there is no way to create something faster.

Jun 29 '22 08:06 rhpvorderman

Hi, I have no time to look into this now because I’ll be on vacation soon, but what a great result! So do I understand correctly that most of the improvement comes from avoiding the Cython way of iteration?

Perhaps it’s time to go the C route, let me think about this when I’m back.

Jun 29 '22 10:06 marcelm

Well, the gains are significant, but only because dnaio is already fast. It is going from 900ms for 5 million records to 700ms. So it is quite significant in percentage terms, but not so much in absolute time. And it will be somewhat harder to maintain the C-code than the Cython code. So that is a real tradeoff. EDIT: You can run a few benchmarks on your PC and check the code to see if you think the gains are worth it. Happy vacation!

Jun 29 '22 10:06 rhpvorderman

I have an idea about an architectural change that may improve speed, but it only makes sense if we do the above move to C.

The current flow is that we:

Read in the buffer
Check the entire buffer for ASCII
Parse a record
Wait until the next call
Parse another record
Wait until the next call
etc.

Only one record is parsed at the time. What if we:

Read in the buffer
Check the entire buffer for ASCII
Parse all the records in the buffer and put the PyObject pointers in an array
Use an incrementing index with the array to return a PyObject pointer for each next call.
Restart the process when we reached the end of the array.

That way there is no context switching between the Python code which does something with the records and our C code which parses the FASTQ. Since the same contiguous buffer is parsed all at once, this gives the compiler some chances to do some optimizations which are not possible when you only parse one record at the time. Also, since we are operating on the same buffer continuously, it can stay very high in the cache hierarchy, which could improve speeds. I think this has some potential, but it will take some time to code this and there is a risk (as always) that it does not matter for speed.

Jun 30 '22 04:06 rhpvorderman

I did check the above out, and it is not faster. Furthermore I managed to eliminate the greatest obstacle for fast object creatin in Cython: the argument parsing in #99 . It seems that the C extension is faster still, but not that much faster. (I wouldn't mind if we switch to C, but that will make certain other things harder and that has to be taken into consideration).

Oct 28 '22 09:10 rhpvorderman