KrakenTools icon indicating copy to clipboard operation
KrakenTools copied to clipboard

extract_kraken_reads to support multithreading?

Open Kirk3gaard opened this issue 4 years ago • 5 comments

Hi

Thanks for a great tool. Great to be able to process the output files of kraken without knowing all of the TAXIDs and how they are linked together.

Have you considered making extract_kraken_reads support multi-threading to run faster on computers with multiple CPUs? It seems that it uses only one CPU.

I naively just split my fastq files to run it through with parallel but apparently loading the database once for each parallel process quickly maxed out the ´1 TB RAM in the computer I was using.

Best regards Rasmus

Kirk3gaard avatar Aug 21 '20 11:08 Kirk3gaard

This definitely is something that I need to figure out. Originally all of the scripts were quickly written in python, which isn't as conducive to multithreading as C/C++ but I will look and see if there is an easy solution.

jenniferlu717 avatar Aug 26 '20 18:08 jenniferlu717

Hi,

I stumbled across the same issue and maybe you can also just build a lightweight solution. As the costly part is the extraction of reads mapping the precomputed list of ids, I slightly modified the script to simply store the ids to some .txt file (or two files in case of paired-end), which I then use as input for seqtk subseq or seqkit grep to perform the read extraction.

As I use this in a workflow, for me it is fine to do it in two separate steps. But you could maybe also just create a subprocess to call one of these tools and make it a requirement for the user environment (maybe optional).

Just an idea...

Best, Sandro

andreott avatar Aug 11 '22 13:08 andreott

Hi @andreott , Could you possibly share your solution with me? Kind regards, Morten

MortenEneberg avatar Sep 28 '22 10:09 MortenEneberg

There's a better tools (imo) seqkit that will perform similar function. Create a file with containing the taxon ids you want, taxons.txt eg:

464095
12058
12059
138948

You can extract the reads matching using:

seqkit grep -r -f taxons.txt --threads 4 classified_1.fastq.gz > taxons.fastq.gz

More details: https://bioinf.shenwei.me/seqkit/usage/#grep

Getting the taxon list can be automated with TaxonKit https://bioinf.shenwei.me/taxonkit/download/ (same author as seqkit)

ammaraziz avatar Dec 28 '22 11:12 ammaraziz

I'm a fan of seqkit, but unfortunately it is insufficient if you want to include sequences from the parent clade levels, which should be included in the tool features (please fix it!)

chiaramazzoni avatar Apr 04 '23 16:04 chiaramazzoni