qiime icon indicating copy to clipboard operation
qiime copied to clipboard

Split libraries should be parallelized

Open alk224 opened this issue 10 years ago • 11 comments

I hacked together a parallel version of split libraries fastq in a couple of hours. It runs much much faster than the native qiime command. Sometimes this can be one of rate-limiting steps in getting QC data to clients, so if it can be faster it would be useful to more people than just myself.

See my code here: https://github.com/alk224/FirstRepository

alk224 avatar Dec 10 '14 20:12 alk224

Thanks for sharing the code @alk224. Agree - we're planning a complete re-write of this for QIIME 2, and it will likely be parallelized at that time. I'll assign this to the QIIME 2 milestone.

gregcaporaso avatar Dec 10 '14 23:12 gregcaporaso

@alk224, just to note, I believe the use of split assumes the input data are single line fasta/fastq and that the sequence data are not spread over multiple lines. I may be wrong though

On Wed, Dec 10, 2014 at 4:10 PM, Greg Caporaso [email protected] wrote:

Thanks for sharing the code @alk224 https://github.com/alk224. Agree - we're planning a complete re-write of this for QIIME 2, and it will likely be parallelized at that time. I'll assign this to the QIIME 2 milestone.

— Reply to this email directly or view it on GitHub https://github.com/biocore/qiime/issues/1773#issuecomment-66542101.

wasade avatar Dec 10 '14 23:12 wasade

Yes, split is totally inappropriate here. But I don't really know any better. I think awk is the appropriate command to use as it will enable proper naming of output files. The way I used split it is breaking up the files at each line. This script will crash if you choose the wrong number of cores and you wind up with a broken fastq line somewhere (or in several places).

alk224 avatar Dec 10 '14 23:12 alk224

Thanks @alk224, agree this will be useful to address in QIIME 2!

jairideout avatar Dec 11 '14 04:12 jairideout

:+1: This is very much needed.

It just took me 30 minutes to join and 45 mins to split a single MiSeq run. I shutter to think what a HiSeq run takes.

colinbrislawn avatar Feb 23 '15 23:02 colinbrislawn

@colinbrislawn I may have a solution in the near future using a dependency called fasta splitter (http://kirill-kryukov.com/study/tools/fasta-splitter/), but I still need to add a step where the sample names of each split libraries output is modified. Otherwise, you will have multiple sequences with identical IDs (one duplicate name for every separate core you utilize).

alk224 avatar Feb 24 '15 00:02 alk224

I forgot I deleted that old repository. That script now lives here. I will probably rewrite the whole thing soon so it is less idiotic.

https://github.com/alk224/akutils-dev/blob/master/dev-split_libraries_parallel.sh

alk224 avatar Feb 24 '15 00:02 alk224

+1 for a faster split_libraries and split_libraries_fastq.

I wanted some extra speed and I ended up writing something similar to @alk224 which is based on the idea of splitting big fastq files, batch processing them in parallel and recombining the results. You can check it out here under parallel_split_libraries_fastq. I am able to get a miseq runsplit_libraries_fastq job to go from from hours/overnight to under 10 minutes using a machine with 30 cores.

As we still have a little bit of 454 data in the lab, I have also taken a stab at making split_library.py parallel. My work on that is in this branch. In that work I tried to refactor the main work function check_seq. I represented each sequence as a dictionary of values and have lots of functions act on the dictionaries thereby divorcing the sequence/barcode/testing info from the loop. Quality test? Put it in the dict. Barcode test? put it in the dict. I can then pass the values around and have lots of functions act on it and tally up the results later. That style lets you hit the stream of sequence-dictionaries with multiprocessing and fan your work out.

So I got a decent speed-up on a test set: 16 minutes (1core) to 10 minutes (3 core) using extra cores but the bottleneck is still in fact that you need to load and lookup quality information for each sequence. However, that is tricky to parallelize for the following reason: there are no global variables in Multiprocessing Processes. You either have to give each process its own copy of the reference data which is fine when the data is small but not fine when it is a multi-Gb dictionary, or you need to have some sort of external database that can be independently called by each Process. At that point it wasn't worth my time to implement the Quality dictionary as a database but I bet it could be done and made to be very fast.

In short, I think adopting a functional style could allow for big speedups in both scripts and I would be happy to work on this a little bit more if is useful or could lead to inclusion. If not, I have my workarounds/speedups and it was a good lesson in python parallelization.

thanks, zach cp

zachcp avatar Mar 19 '15 22:03 zachcp

@zachcp excellent and thank you! I tried rewriting my old version a few weeks ago and it ran perfectly on test data, but kept messing on real data (I was running the old split_libraries_fastq.py simultaneously to check that I had identical outputs). I had a similar speed up. One thing I did manage to do was collate all the information from the output log files so that you have a coherent split_libraries_log.txt file. I did this with simple bash tools. I'm still pretty bad at python, but maybe it is useful to combine the python and bash components of our scripts (if anyone bilingual actually has time for this). The current version of my script is here: https://github.com/alk224/akutils/blob/master/split_libraries_parallel.sh.

alk224 avatar Mar 19 '15 22:03 alk224

I am not sure if this applies to all of split_libraries functionality, but I was able to reproduce split_libraries primer trimming feature by incorporating a memoizied adaptation of a levenshtien search. This reduced the time on an ~3Gig merged data set from ~5hr --> ~30min. https://github.com/billyfournier/remove_primers_levenshtien/blob/master/extract_primer_lev.py Again, this only deals with the primer removal functionality, but maybe someone will find it useful.

I attempted using pools and threads with no success.

billyfournier avatar May 24 '16 19:05 billyfournier

For those using usearch/vsearch for quality filtering, VSEARCH now supports --relabel. Combine this with gnu parallel and you can append qiime compatible labels to many files at very high speed.

The usefulness of this depends on what quality filtering paradigm is adopted for qiime 2. If the 'maximum expected error' filtering is chosen over 'sliding window', this would be a great option.

colinbrislawn avatar Jul 12 '16 21:07 colinbrislawn