dnaio icon indicating copy to clipboard operation
dnaio copied to clipboard

SequenceRecord array class

Open marcelm opened this issue 2 years ago • 4 comments

The discussion in #76 reminded me of an idea I had a while ago for possibly improving performance when dealing with many SequenceRecord objects: We could implement some kind of SequenceRecordArray class that represents an array of sequence records. It would support iteration (of course) to get the individual SequenceRecord objects. And as #76 suggests, it could also support a fastq_bytes() method that serializes all records to FASTQ en bloc.

This would replace the current functions for reading chunked data. So instead of the read_chunks() function, which returns memoryviews, one would then use a function that returns SequenceRecordArrays. Similar to how pandas.read_table does it, we could even add a chunksize parameter to dnaio.open that would then make it return SequenceRecordArray objects.

I think this would simplify some of the code in Cutadapt that transfers sequence data to and from worker processes. For that, I’d also want such an object to have extra attributes chunk_index and n_records.

marcelm avatar Apr 25 '22 09:04 marcelm

Wouldn't it be easer to use the threading module and do the alignment with nogil:? That way the SequenceRecords themselves can be distributed at low cost (threading allows for shared memory, so there is no overhead cost). Just pass a list of SequenceRecords (or individual records) to a worker. Also the alignment can escape the python thread and use the designated amount of cores. That way there is no need for intermediate structures and the pipeline is simplified.

rhpvorderman avatar Apr 25 '22 11:04 rhpvorderman

Using threads wouldn’t scale so well because there’s still lots of Python code in each worker that requires the GIL. It would also mean a complete restructuring of the multicore code paths, so in that sense, it’s not easier.

marcelm avatar Apr 25 '22 12:04 marcelm

Sure, you know that project best. So the SequenceRecordArray should be implemented.

I propose the following design:

We create a new struct:

cdef struct SequenceRecordOffsets:
    Py_ssize_t name_offset
    Py_ssize_t name_length
    Py_ssize_t sequence_offset
    Py_ssize_t sequence_length
    Py_ssize_t qualities_offset

SequenceRecordArray gets a cdef SequenceRecordOffset *offsets variable length array sized to n_records. (This requires some cost amortization for the resizing but that shouldn't be too hard to implement).

Internally, we use memchr to find all the newlines and create these offset structs. We save the offsets and the memory block. The iterator will just take the offsets and create a sequence record very quickly. This also allows indexed access of the SequenceRecordArray.

The advantage of this method is that we do the newline parsing only once, at the cost of saving 40 bytes per FASTQ record. Since illumina now uses 150bp records which take about 335 bp including qualities and name, I think that is an acceptable overhead cost. (~13%). I guess this newline parse saving is what you had in mind when you proposed the idea?

Due to creating very few python objects as opposed to creating 3 strings and 1 containing SequenceRecord object per FASTQ record, this method will read through a FASTQ file substantially faster than FastqIter. So it should not be a bottleneck.

Similar to how pandas.read_table does it, we could even add a chunksize parameter to dnaio.open that would then make it return SequenceRecordArray objects.

I prefer if this would be a chunk_open method. Overloading the open method will make it more confusing to use and the code less intelligible (two cases of dnaio.open having very different returned objects in that case).

rhpvorderman avatar Apr 25 '22 12:04 rhpvorderman

I just wanted to write the idea down to be honest, but it’s great you already came up with a design! Sorry that I have some other things to do at the moment, so I’d suggest to just leave this here as a reminder until I have some time to think about the requirements a bit more. I’m not sure whether there’ll be any performance benefits to this since in the end, one SequenceRecord would still need to be created for each record.

marcelm avatar Apr 25 '22 13:04 marcelm

I have implemented a SequenceRecordArray class in sequali. There were performance benefits, but only because the methods in sequali deal with entire sequencerecordarrays. (As a result, this was massively beneficial for illumina, and less so for nanopore data).

Most of the performance cost in dnaio comes from allocating separate unicode objects for name sequence and qualities. Since cutadapt operates on these entities there is not much of a performance benefit.

rhpvorderman avatar Dec 11 '23 10:12 rhpvorderman

Cool, thanks for testing and reporting! I understand, so let’s just close this.

marcelm avatar Dec 11 '23 12:12 marcelm