samtools icon indicating copy to clipboard operation
samtools copied to clipboard

mpileup multithread

Open sethbrin opened this issue 10 years ago • 13 comments

How to parallel mpileup as sort or merge?

sethbrin avatar Oct 26 '15 03:10 sethbrin

Hi - currently mpileup is not multi-threaded. What people usually do is to break the genome into regions and run them in parallel. We will add this request as a possible future enhancement.

Thomas

tk2 avatar Nov 09 '15 13:11 tk2

Hi, this would be a great enhancement I'd highly appreciate this feature!

oschwengers avatar May 10 '16 19:05 oschwengers

Has anyone even profiled it to work out where the time is spent? Is it decoding, consensus/snp calling, or encoding the output (eg bcf+bgzf could take some time).

jkbonfield avatar May 10 '16 19:05 jkbonfield

1 year later, wondering if it is now implemented or not.

Thanks!

crazyhottommy avatar Nov 09 '16 02:11 crazyhottommy

Alas not. We did some profiling, but the algorithm is complex and rather hard to multi-thread in the current state. It's a wish-list item basically unless anyone else wants to tackle it and submit a PR. Some of the components that mpileup uses are now multi-threaded (specifically decoding the input files), but this is only a small fraction of the total time. We also sped up the algorithm a little bit, but it's still single-threaded and too slow.

As Thomas points out, the usual way of working around this is simply to run multiple mpileups on different regions and merge the results.

jkbonfield avatar Nov 09 '16 09:11 jkbonfield

Hi! Any news?

serge2016 avatar Aug 29 '19 04:08 serge2016

Sadly no. It's something badly in need of work, but it's a complex task unless it's parallelised as above by dividing up into regions (and even then it's not trivial because neighbouring reads have an effect). That's probably the easiest way forward and how things like the new SAM threaded parsing works. It's simply not something we've had time to do yet.

jkbonfield avatar Aug 29 '19 08:08 jkbonfield

GNU Parallel is a great alternative to xargs for this kind of thing, and saves you having to cleanup the temporary files.

This command will parallelize over chromosomes/contigs with one simultaneous job per core, writing all results to my.pileup:

parallel --colsep '\t' samtools mpileup -b my_bams.fofn -r {1} :::: genome.fa.fai > my.pileup

Where my_bams.fofn is a file of BAM files, and genome.fa.fai is the output of samtools faidx or alternately a newline separated list of chromosomes.

If fewer threads are desired, they can be specified with the --jobs flag, e.g. --jobs 16 runs 16 simultaneous jobs. Adding the --keep-order flag will output results in the same order they occur in genome.fa.fai at the cost of load-balancing.

evanrrees avatar Sep 05 '19 00:09 evanrrees