dnaio Switch single&paired API to single&multiple API.

Currently I am working a lot with UMI data that is stored in a separate FASTQ file meaning I have 3 files now.

I needed to filter those files on average error rate so I adopted the fastq-filter program to work with multiple files.

To keep the pipeline simple. I opted to have a Multiple file reader. This yields 1-tuples for 1 file, 2-tuples for 2 files, 3-tuples for 3 files, etc. This way I can write the filters to always handle a tuple of SequenceRecord objects and use the same filter in all cases. Similarly I wrote a multiple writer.

I am wondering if we should do this in dnaio too. There are now two cases in dniao:

Single file. Yield one SequenceRecord object.
Paired file. Yield a 2-tuple of SequenceRecord objects.

I propose replacing the latter with a multipe file reader that can read n number of records and yields n-tuples of SequenceRecords. The PairedEndReader and PairedEndWriter interfaces can still be maintained, but these can simply inherit the MultipleReaders and provide a backwards compatible interface. (Shouldn't be too hard given it is just the 2-case of the MultipleReader).

This way I do not have to reinvent the wheel across multiple projects. I also feel this is needed for cutadapt. Which needs a sort of auxilary file option, where the auxilary file with the UMIs is kept in sync with the FASTQ files that are output from cutadapt. Currently I have to use biopet-fastqsync to sync the UMI FASTQ file afterwards. (This is not the correct place to raise this issue, but I simply state this here to show that I think this will be a good move for the future).

I already have implemented a multiple reader in my FASTQ filter project. At first it was written in a generic manner. (Everything is a list of multiple files.) But I discovered that severely harms the single-end and paired-end cases: https://github.com/LUMC/fastq-filter/pull/16 . I wonder what the best way is to implement is in dnaio. Alternatively there could be separate 1-tuple 2-tuple n-tuple readers that all share the same interface trough abstract classes.

Jun 01 '22 08:06 rhpvorderman

Generalizing the paired-end reader to multiple files sounds like a good idea. I think I’d implement this by accepting more than two input files in dnaio.open and then the function would work as before for n=1 and n=2 (so totally backwards compatible for the single end and paired-end cases). Then for n>2, it would return this new MultipleReader (not sure whether that is the best name, though). Is that what you meant?

This would indeed be a requirement for supporting records with more than two "ends" in Cutadapt.

Jun 03 '22 09:06 marcelm

Yes that is what I meant. Generalizing dnaio.open seems indeed the best path. MultipleReader is not intended to be the final name. I am struggling to think of a better one though. One issue is that the current naming "PairedEnd" is not very applicable with N FASTQ files. "NEndReader" is not going to win the hearts and minds of anyone I am afraid. Oh well, I am sure a better name will pop up in our minds at some point.

Jun 03 '22 10:06 rhpvorderman