compresslevel and open_threads should be arguments in the dnaio.open interface
Hi, I have been working on the benchmarking and I am quite intrigued by fastp. The project is very similar to dnaio in terms of it API. Paired reading, writing, a focus on speed, using zlib-compatible libraries to speed up (de)compression etc.
Anyway, I noticed that the fastq.gz reading benchmark between dnaio and fastp is not entirely fair, as the benchmark program always uses a single thread and dnaio gets around the GIL by launching an igzip process. This made me realize that in al my programs I do this:
import functools
import xopen
DEFAULT_OPENER = functools.partial(xopen, threads=0, compresslevel=1))
# Use all dnaio.open calls with DEFAULT_OPENER
This is quite an ugly way to interact with an API.
I propose that we add open_threads=0 and compresslevel=1 to the dnaio.open interface. open_threads should be 0, as any polite program will only go multithreaded when explicitly asked (unless it has parallel in its name). This will use python-isal by default, which is more efficient than piping through igzip (in terms of total cpu time, not wall clock time). We can add a nice docstring to the parameter that explains an external program will be opened when threads > 0.
Compresslevel should be set at 1. This is the most efficient compression level in terms of compute time for a given final size. Tools producing FASTQ almost always produce intermediate files (they are mapped afterwards), so the most common use case will be one where throughput is most important. An advantage of having a compresslevel parameter in dnaio.open is that it is very easy for users to change the parameter if they feel compression is more important than throughput. I know you have argued differently for cutadapt, but IIRC that was primarily motivated by backwards-compatibility concerns, where changing the behaviour of the program drastically was undesirable (which I agree with). Dnaio is not as well established, and in fact, still in its 0.x days.
I would have proposed this change with a PR, but since #87 also messes with the open function I leave this for after that is merged (or rejected).