seq_io icon indicating copy to clipboard operation
seq_io copied to clipboard

alternative implementation of parallel

Open noamteyssier opened this issue 1 year ago • 3 comments

Heyo,

Love this crate and have used it in many of my own.

I recently have been working on an alternative parallel implementation to the one provided in the crate that follows a stateful map reduce pattern. Currently a separate crate called seq_io_parallel.

I've found it to be very ergonomic in my own work but is fairly different than the closure based implementation found in this crate.

It currently just extends the types of this crate, and honestly is a little hacky because of that, but if you think it useful or feel like we could implement this into seq_io I'd be happy to work on it.

Cheers, Noam

noamteyssier avatar Jan 09 '25 18:01 noamteyssier

Thanks a lot for pointing me at your implementation, it looks very interesting. In fact, I have also been working on a redesign of the parallel API with at least less closures and paired-end processing, but not quite finished and not quite sure if I went in the right direction. I'm open for a completely different implementation if it is better. I still need some time to look at your implementation, but I definitely like the closure-less design. In case you are interested, I could push my code in its current state to a separate branch.

Regarding the integration in seq_io: It could be seen as a disadvantage that seq_io currently has extra dependencies just introduced by the parallel processing code. For this reason, I have been thinking whether I should hide the parallel code behind a feature flag. Or, an extra crate could actually be a good solution, in which case I could remove the parallel implementation from seq_io and make sure the two crates work well together.

markschl avatar Jan 14 '25 07:01 markschl

Yeah I'd be interested in taking a look at what you have as well.

I'm not sure what would be best honestly.

Keeping it behind a feature flag has advantages because it would allow for a single import solution, but practically it would force a single parallel implementation onto the user.

Keeping separate crates adds some overhead to maintenance. I think for a stable extension crate they would need to re-export seq_io. But it is more flexible and if somebody else comes up with an even better parallel implementation (which I'm sure someone will) then they can easily extend it.

While this approach is more democratic and separates concerns, I think a practical problem is that crates.io has single namespace for crates so there could be a bunch of extensions with slightly varying names like seq_io_parallel, seq_io_parallel_closures, yet_another_seq_io_parallel.

There are pros and cons to both approach, but personally I am leaning towards a feature flag implementation.

noamteyssier avatar Jan 14 '25 17:01 noamteyssier

I finally had another look, although I still didn't have the opportunity to test and benchmark it by myself. But the approach looks elegant and it seems like a nice addition.

Regarding the implementation in the seq_io parallel module, at the time I had very specific requirements in mind for this sequence handling tool, where records are read and written to output in a "streaming" manner, so I needed to immediately access the processing results in the main thread along with the record sets themselves (not a map-reduce approach). I'm sure that there might be smarter ways, and I'm happy about suggestions, there as well.

Regarding paired-end read processing in seq_io_parallel, I think that this implementation could run into problems with record sets from the R1 and R2 files not having the same number of records. This will cause records to go out of sync. Of course, in many cases, the header and sequence lengths are equal, and thus one might be lucky that the problem does not occur. What would be needed is a version of read_record_set that allows to specify a fixed number of records. Thanks to the most recent code changes, it might be possible to add such a method (e.g. read_record_set_exact), but I could take some time to get it working.

markschl avatar Jan 26 '25 21:01 markschl