minimap2 Optimize run time on multiple threads

Optimize run time on multiple threads

Open nguyendm79 opened this issue 1 year ago • 6 comments

Dear Dr. Li,

I am currently trying to achieve shortest runtime by utilizing the available resources on the HPC that we have at the moment (128 threads CPU with 128GB RAM). Our dataset is ~400bp DNA sequences read (around 50K-100K reads) aligning to GRCh37. Through several tries with the current command (the pre-built index is created previously with -k10):

minimap2 -ax map-ont -k10 -O1 -E2 -t13 -2 pre-built.mmi input-fasta>output-sam

13 threads seem to give the best analysis time around 4 mins. Increasing the amount of threads even to 128, doesn't help to lower the run time. However, increasing to 128 threads created the index mmi much faster than using 13 threads. Looking at the memory consumption using htop, it barely uses more than 12Gb.

So, I tried to run minimap2 using the same command on 2 datasets concurrently and the run time for each one doubled compared to running them separately even though CPU and memory usages are much below the available resources.

Could you please help to give me some insights on what might be the bottleneck in the pipeline and some recommendations to utilize more resources aiming to lower the run time on the datasets?

Thank you so much Dr. Li,

Duc

Feb 25 '23 04:02 nguyendm79

These are classical signs that you are I/O bound. That is, scaling beyond 13 threads isn't helping because minimap2 cannot consume the input data fast enough to make use of the extra available threads. This is exacerbated if the output is also being written to the same disk, as minimap2 also has to write the alignment output to the same disk from which it is reading the input sequencing reads. This also explains the behavior you see when mapping 2 datasets at the same time (assuming that the input reads and / or output alignment files for the 2 datasets are on the same hard disk).

In general, when the problem solved by a program (like alignment in minimap2) are embarrassingly parallel, I/O often becomes the eventual scalability bottleneck. In fact, Ben Langmead wrote an entire paper on this topic several years ago that you can find here.

Best, Rob

Feb 25 '23 04:02 rob-p

Dear Rob,

Thank you so much for your prompt feedback with input and reference. I originally suspected it could an issue with I/O so that really helped. We used an SSD in our HPC, I'm actually surprised that it's maxed out in this case.

Do you have any recommendations on how to improve the run time in this case, with changes with hardwares possibly or anything else from your experience? We are very much open to any suggestions.

Thank you so much,

Duc

Feb 25 '23 04:02 nguyendm79

Hi Duc,

There are several things you might do to try and speed things up. One thing you might try to do is to put the output on a different disk than the input. This way, the total I/O throughput is divided between the disks, and for the output from the input.

Another thing you might try to do is to make sure that both your output and input are compressed. That is, use e.g. a gzip compressed fastq file as input and pipe the output SAM file to samtools to convert it to BAM (with multiple threads) when writing the output. Since you have many threads, it makes sense to trade compute at this stage to minimize I/O and improve overall throughput.

The above solutions don't require any extra hardware (as long as you already have separate physical disks). If you are looking to purchase new hardware to speed things up, you might consider buying even faster disks (e.g. NVMe). However, I'd do a bit of analysis to make sure that the alignment throughput is really a bottleneck before I went and spent money on new expensive disks to reduce the time and allow broader parallelization.

Best, Rob

Feb 25 '23 17:02 rob-p

Dear Rob,

Thank you so much for your time and suggestions. These are very insightful indeed. I will try these suggestions first to see if I can further optimize the current performance before exploring options for new and additional hardwares.

Bests,

Duc

Feb 26 '23 13:02 nguyendm79

What Rob said is all true. On top of that, increasing the batch size with -K will help on large dataset. Your input file is small, though. You may not see large improvement. In general, minimap2 is not optimized for many threads.

@rob-p

Another thing you might try to do is to make sure that both your output and input are compressed.

When many threads are specified, zlib decompression on a single thread may become the bottleneck. Using pigz and piping an uncompressed stream to minimap2 might be faster.

Feb 27 '23 16:02 lh3

Dear Dr. Li,

Thank you so much for your input. I tried increasing -K before and indeed there weren’t much improvement compared to the default setting. With the compression option, I previously ran on uncompressed fasta. After the suggestion on trying the decompressed file, I experimented a bit and noticed a longer run time - which should be reasonable expectation. I will experiment a bit more and will consider adding a couple SSDs to channel the I/O to see if the run time can be further optimized.

Thank you for your time and guidance.

Bests, Duc

Feb 28 '23 06:02 nguyendm79

minimap2 minimap2 copied to clipboard

Optimize run time on multiple threads

minimap2
minimap2 copied to clipboard