dnaio Multiple reader/writer api

I added the ability for dnaio.open to open multiple files. I also added the dnaio.open_multiple API which iworks in case you want 1-tuples for 1 file. (I use this sort of code in my fastq filter and deduplication programs)

dnaio.open is backwards compatible. Also I think MultipleFastqWriter is probably going to be faster than PairedFastqWriter, but I will benchmark that later and backport any changes.

The MultipleFastXWriter objects also have a write_iterable call which you can give a SequenceRecord tuple generator. This is much faster than repeatedly calling write calls in a for loop. This can also be backported to the PairedFastqWriter and SingleFastqWriter at a later date.

The API is fully tested. Just need to add some docstrings. Posting this here in advance for code review and to make sure this effort is not duplicated.

Jul 19 '22 14:07 rhpvorderman

on second thought, this extra open_multiple API is going to be confusing and there is some code duplication. Let me simplify this.

Jul 20 '22 04:07 rhpvorderman

Nice, I’ve started a review and have gotten halfway, will submit when I’m finished.

Aug 02 '22 17:08 marcelm

What would be the downside of not only backporting some of this to the PairedEnd... classes, but merging them (at some point in the future, not now) so that there’s only one way to read more than one file?

It appears I forgot the discussion we had in #82.

Aug 05 '22 23:08 marcelm

Done! Sorry for letting this one rest a bit longer. We are changing a clinical pipeline in our institute and that takes up most of my time. Also ironically our sequence center has decided to use new machines/software where instead of providing an extra UMI sequence file, they can put it in the header. So this PR suddenly got less relevant. On the other hand, we do have 3 hardcoded multiple times in dnaio's code because it is a thing.

Documenting the *files parameter was hard until I decided to also deprecate file1. Now it is very straightforward. I just needed to add an elif tree to ensure backwards compatibility but at least the interface should be clear going forward. (If you agree, that is.)

Aug 12 '22 14:08 rhpvorderman

Sorry about the very long delay. Are you still interested in merging this?

Yes. Aren't you? :wink: https://github.com/marcelm/cutadapt/issues/491#issuecomment-1274754051 I have two applications (fastq-filter and fastqdedup) where I run into this issue with more than 2 FASTQ files per readgroup. I still think it is nice to have dnaio solve this for me rather than adding custom code for every project.

I think your most recent change of deprecating the file1 and file2 keyword arguments and instead just having a *files positional argument is a very nice improvement.

It is really weird. Sometimes programming for the general case is an excercise in pedantry which leads to byzantine and complex code, and in other cases it makes the API simpler. Bizarre. Is there a computer science theorem that tackles this problem? (I am a biologist by training).

I have committed al the nitpicks and simplified _open_paired. I also fixed an annoying mypy issue related to sequence_class.

Oct 24 '22 09:10 rhpvorderman

Sorry about the very long delay. Are you still interested in merging this?

Yes. Aren't you? wink marcelm/cutadapt#491 (comment) I have two applications (fastq-filter and fastqdedup) where I run into this issue with more than 2 FASTQ files per readgroup. I still think it is nice to have dnaio solve this for me rather than adding custom code for every project.

Absolutely, I’m just not sure I’ll be able to work on that issue in the foreseeable future. But this PR is the prerequisite for doing it at all.

Bizarre. Is there a computer science theorem that tackles this problem? (I am a biologist by training).

I thought I heard something like this in a computer science context, but googling now only gives me some results in mathematics. But this appears to be a thing, yes.

I have committed al the nitpicks and simplified _open_paired. I also fixed an annoying mypy issue related to sequence_class.

I realized now that we need to make sure that the documentation covers the new classes as well. Do you want to do this here or in a separate PR? I can also help.

Oct 25 '22 08:10 marcelm

I thought I heard something like this in a computer science context, but googling now only gives me some results in mathematics. But this appears to be a thing, yes.

Interesting. Thanks for the link.

I realized now that we need to make sure that the documentation covers the new classes as well. Do you want to do this here or in a separate PR? I can also help.

I have added it just now. What should I do with MultipleFileReader? It is an interface obviously, but also, there is only one class that uses the interface. So I now have it documented as a plain class. That seemed to be the least convoluted way to solve this.

Oct 25 '22 09:10 rhpvorderman

What should I do with MultipleFileReader? It is an interface obviously, but also, there is only one class that uses the interface. So I now have it documented as a plain class. That seemed to be the least convoluted way to solve this.

I don’t have an opinion at the moment, let’s try it this way.

I think this is ready, let’s merge it!

Oct 26 '22 07:10 marcelm

I changed a setting on RTD to let it build the documentation on PRs. And I invited you as a co-maintainer.

Oct 26 '22 08:10 marcelm

I changed a setting on RTD to let it build the documentation on PRs. And I invited you as a co-maintainer.

Nice, that should keep the docstrings tidy. I accepted the invitation, thanks!

Oct 26 '22 09:10 rhpvorderman