fastq::Reader: from_maybe_gzip_path: New instantiation.
Hi,
One thing I've not seen in biological rust-land is a kseq.h type code, where you can provide some method a file path (which might be FIFO like stdin) and it works out for you if it is fastq or fasta, and whether it is gzip or not.
This PR is an attempt at the "gzip or not" part for the fastq reader. It is unfinished (see TODOs in code) because insufficient thought has been given to the architecture here e.g. should a enum be returned, or a dyn fastq::Reader? I figure you might have some ideas already so I didn't want to go too far down this path without hearing them first.
Tests pass, but the code is not benchmarked. It isn't clear to me whether it is OK that the gzip reader is using the right size of buffer. Multithreading the gzip decompression might also speed things up.
Let me know what you think. Thanks, ben
Thanks for your interest in this library and the pull request!
To be honest, I avoided this kind of functionality up to now, because there is several possibilities to implement this. I preferred letting users choose how to do it on their own. Still, I'm open to add a simple 'magic' function that suits most use cases (if possible), and possibly also checks for FASTQ vs. FASTA. Or, at least to make it as easy as possible to implement this functionality given specific needs. A few comments/thoughts:
- I think your approach is a good start, and supplying the
buf_redux::BufReadertoflate2::bufread::GzDecoderis a good idea. GZIP decoding is probably always (?) fastest if filesystem I/O is buffered. - With STDIN, I think the situation is more complex.
std::io::Stdindoes not give access to the buffer, we need to calllock().std:io::StdinLockimplementsBufRead, and a call tofill_buf()would give access to the buffer. However,StdinLockborrows fromStdin, and therefore, I think we could only return an object with static lifetime. WrappingStdinin aBufReadermight have a negative impact on performance. - Your approach of checking the '@' byte is fine if you are sure it's FASTQ. An even better way would probably be to check for the presence of a GZIP header as in this function of fastq-rs. In this project (for which I originally created seq_io), I used the file path only to determine the format. Then, I supply
Box<dyn std::io::Read>(which is either aFileor STDIN, optionally wrapped in a decoder for a compression format) to the FASTQ / FASTA parsers. - For a magic function that checks for compression and recognizes the sequence format, I would prefer a trait-based approach, because this should supposedly also integrate with
RecordSetand functions in theparallelmodule. I suspect, an enum based approach is maybe fine in a simple use case as in your code, but with more options it could quickly become complicated. I don't actually know, how these approaches would compare in a benchmark, the result may also depend on record size (sequence length). My guess is that differences are negligible, since most of the time is spent with sequence parsing. - If following a trait-based approach,
seq_io::parallel::Readertrait could be replaced by a more general trait implemented by FASTA and FASTQ readers, possibly namedseq_io::Readanalogous tostd::ionaming, orseq_io::Reader. It would need some associated types (RecordandRecordSet). In addition, there could be aRecordtrait, which is implemented by FASTA / FASTQ records. This would allow abstracting over sequence types, and a function likeRecord::qual()could returnOption<&[u8]>, depending on whether there is quality information or not. Supporting multi-line FASTA is more complicated. Here I actually used an enum approach for this problem, not sure if that is the best/most elegant solution. - Decoding a GZIP file in a single background thread is for instance possible with another library I wrote. It's not 'real' multithreaded decoding, but at least it made decoding faster in most cases. Still, I guess it would be best, to default to a single-threaded solution in a general-purpose function.
- Is it enough to support just GZIP, or should BZIP2 also be supported by such a function?
- Such functionality could be feature-gated to avoid unnecessary dependencies...
Given the difficulties with STDIN, there could eventually be two functions, one that takes a file path (as in your code), and another, which takes a BufRead such as StdinLock, so the user would have to take care of the specifics. Both functions would return io::Result<Box<dyn seq_io::Read>>.
I hope that this makes sense.
Currently, I'm working on some code refactoring, but after that I'm happy to do something on this topic. And contributions are always welcome.
I rewrote the whole library and added support for "FASTX" with two different approaches. GZIP recognition should be easy to implement now (for files, not stdin). I'm not sure yet whether it should be in the crate because there may be several ways of implementing it. Comments welcome :)