dnaio icon indicating copy to clipboard operation
dnaio copied to clipboard

Two_headers should be a SequenceRecord attribute

Open rhpvorderman opened this issue 3 years ago • 2 comments

Currently dnaio uses a slightly unusual architecure where the first value of fastq_iter is a boolean, not a SequenceRecord. This determines whether all coming fastq headers are printed with two headers. FastqWriter has a rather quirky implementation to determine its write method.

I think this can be best solved by having a boolean attribute to each sequencerecord. This can be set instantly without branching (no if statement). We can then add a fastq_bytes_as_input method, which will print one or two headers based on the boolean attribute. The fastq_bytes_as_input method can then be used by the FastqWriter class.

This will be fairly trivial to implement once the C-code PR is merged.

rhpvorderman avatar Feb 04 '22 21:02 rhpvorderman

I have been thinking a bit about this. FastqWriter could also simply use the boolean flag that is part of fastq_bytes. That would make it a lot simpler.

As for determining two_headers, it might be better to factor this out of FastqIter altogether and instead write a python method that relies on peek.

rhpvorderman avatar Feb 13 '22 08:02 rhpvorderman

I tried factoring the two header system out of FastqIter altogether, but it is impossible to determine the two_header status outside that loop if the file cannot be seeked and reads are longer than the size of io.BufferedReader's buffer. (When reading from stdin or a pipe).

rhpvorderman avatar Feb 15 '22 14:02 rhpvorderman