seq_io icon indicating copy to clipboard operation
seq_io copied to clipboard

fastq::Reader: from_maybe_gzip_path: New instantiation.

Open wwood opened this issue 6 years ago • 2 comments

Hi,

One thing I've not seen in biological rust-land is a kseq.h type code, where you can provide some method a file path (which might be FIFO like stdin) and it works out for you if it is fastq or fasta, and whether it is gzip or not.

This PR is an attempt at the "gzip or not" part for the fastq reader. It is unfinished (see TODOs in code) because insufficient thought has been given to the architecture here e.g. should a enum be returned, or a dyn fastq::Reader? I figure you might have some ideas already so I didn't want to go too far down this path without hearing them first.

Tests pass, but the code is not benchmarked. It isn't clear to me whether it is OK that the gzip reader is using the right size of buffer. Multithreading the gzip decompression might also speed things up.

Let me know what you think. Thanks, ben

wwood avatar Oct 18 '19 00:10 wwood

Thanks for your interest in this library and the pull request!

To be honest, I avoided this kind of functionality up to now, because there is several possibilities to implement this. I preferred letting users choose how to do it on their own. Still, I'm open to add a simple 'magic' function that suits most use cases (if possible), and possibly also checks for FASTQ vs. FASTA. Or, at least to make it as easy as possible to implement this functionality given specific needs. A few comments/thoughts:

  • I think your approach is a good start, and supplying the buf_redux::BufReader to flate2::bufread::GzDecoder is a good idea. GZIP decoding is probably always (?) fastest if filesystem I/O is buffered.
  • With STDIN, I think the situation is more complex. std::io::Stdin does not give access to the buffer, we need to call lock(). std:io::StdinLock implements BufRead, and a call to fill_buf() would give access to the buffer. However, StdinLock borrows from Stdin, and therefore, I think we could only return an object with static lifetime. Wrapping Stdin in a BufReader might have a negative impact on performance.
  • Your approach of checking the '@' byte is fine if you are sure it's FASTQ. An even better way would probably be to check for the presence of a GZIP header as in this function of fastq-rs. In this project (for which I originally created seq_io), I used the file path only to determine the format. Then, I supply Box<dyn std::io::Read> (which is either a File or STDIN, optionally wrapped in a decoder for a compression format) to the FASTQ / FASTA parsers.
  • For a magic function that checks for compression and recognizes the sequence format, I would prefer a trait-based approach, because this should supposedly also integrate with RecordSet and functions in the parallel module. I suspect, an enum based approach is maybe fine in a simple use case as in your code, but with more options it could quickly become complicated. I don't actually know, how these approaches would compare in a benchmark, the result may also depend on record size (sequence length). My guess is that differences are negligible, since most of the time is spent with sequence parsing.
  • If following a trait-based approach, seq_io::parallel::Reader trait could be replaced by a more general trait implemented by FASTA and FASTQ readers, possibly named seq_io::Read analogous to std::io naming, or seq_io::Reader. It would need some associated types (Record and RecordSet). In addition, there could be a Record trait, which is implemented by FASTA / FASTQ records. This would allow abstracting over sequence types, and a function like Record::qual() could return Option<&[u8]>, depending on whether there is quality information or not. Supporting multi-line FASTA is more complicated. Here I actually used an enum approach for this problem, not sure if that is the best/most elegant solution.
  • Decoding a GZIP file in a single background thread is for instance possible with another library I wrote. It's not 'real' multithreaded decoding, but at least it made decoding faster in most cases. Still, I guess it would be best, to default to a single-threaded solution in a general-purpose function.
  • Is it enough to support just GZIP, or should BZIP2 also be supported by such a function?
  • Such functionality could be feature-gated to avoid unnecessary dependencies...

Given the difficulties with STDIN, there could eventually be two functions, one that takes a file path (as in your code), and another, which takes a BufRead such as StdinLock, so the user would have to take care of the specifics. Both functions would return io::Result<Box<dyn seq_io::Read>>.

I hope that this makes sense.

Currently, I'm working on some code refactoring, but after that I'm happy to do something on this topic. And contributions are always welcome.

markschl avatar Oct 23 '19 12:10 markschl

I rewrote the whole library and added support for "FASTX" with two different approaches. GZIP recognition should be easy to implement now (for files, not stdin). I'm not sure yet whether it should be in the crate because there may be several ways of implementing it. Comments welcome :)

markschl avatar Oct 19 '20 11:10 markschl