bwa Multithread support for bwa index

Hi all,

Current databases are becoming increasingly large. Recently I've found myself indexing a large FASTA file and taking over 200CPU hours (single thread).

Searching for multithreaded support for bwa index I've landed on a 5 year old mailing-list thread that mentions the existence of some sort of patch. I couldn't find any reference to this patch though.

Regardless, is there any ongoing or planned work to make bwa index parallelizable in some form?

Jan 15 '17 18:01 unode

No, there is no pull request on multi-threaded indexing. Implementing one may take quite some time but might not dramatically improve the performance, especially when you try to build the index within limited space.

Generally, to build a large index, you may consider to use a large block size (option "-b"). This option defaults to 10,000,000. You may increase it to 100,000,000 or even larger, depending on your input. This may save you some time.

Jan 16 '17 03:01 lh3

@lh3 Thanks, increasing -b does seem to improve speed considerably.

However I don't quite understand the impact of changing this option. At least during indexing, I don't see any significant memory increase even with values as large as 10,000,000,000.

What's the trade-off or otherwise, why isn't the default value larger?

Jan 17 '17 11:01 unode

-b specifies how many bases to process in a batch. The memory used by one batch is 8*{-b}. If you have a "reference genome" larger than 200Gb, you won't observe obvious memory increase with -b set to 10G. For a 3Gb human genome, setting -b to 10G will make the peak RAM 8 times as high at the BWT construction phase.

Jan 17 '17 14:01 lh3

So if I understand correctly, the ideal -b value is around # of bases / 8. Wouldn't it be possible to have this value adjusted automatically? From what I gather, there's a first pass that packs the FASTA file. Is the -b value already used at this stage? If not, could this stage be used to calculate the ideal -b value?

On the other hand, if finding the ideal -b during the "pack" phase is impractical, would it be reasonable to have:

-b set to "auto" by default
if -b is set to "auto" perform a full file scan to calculate the ideal -b.
if -b is set to anything but "auto", skip the full file scan and use the given value.

Jan 17 '17 14:01 unode

-b is only used when bwa generate "ref.fa.bwt". At that step, bwa index already knows the total length of the reference. -b was added when I wanted to index nt. I have only done that once, so did not bother to explore the optimal -b in general. Yes, it should be possible to automatically adjust -b, but before that I need to do some experiment to see how speed is affected by -b. Thanks for the suggestion anyway.

Jan 18 '17 01:01 lh3

From the tests I've been running, changing the -b value from the default of 10,000,000 to 500,000,000 to index a ~90Gb fasta file made the entire process roughly 6 times faster. I'm now also giving it a try with a value of 20,000,000,000 computed by dividing the value of textLength by 8. If this scales well, I expect a gain of at least 8 times.

Jan 18 '17 02:01 unode

Thanks for the data. 6 times is a lot, much larger than my initial guess. I will consider to automatically adjust -b in a future version of bwa.

Jan 19 '17 20:01 lh3

Hello! Any news in this stream?

Feb 11 '20 05:02 serge2016

Thanks for the data. 6 times is a lot, much larger than my initial guess. I will consider to automatically adjust -b in a future version of bwa.

Hi, I hope everyone is OK in this thread. I am working with large fasta files and I am wondering if this feature is implemented in the current version? Or will it be implemented any time soon? Or should I continue optimising it? best wishes

Apr 10 '20 14:04 emrahkirdok

Hi! I am also curious to know if anything changed since this thread was started. Cheers

Mar 11 '22 18:03 jorondo1

Hi! I will be very happy to see any news in this threads! I am just dreaming about the threads option! It would be great! Cheers!

Oct 23 '23 14:10 Stack7

bwa bwa copied to clipboard

Multithread support for bwa index

bwa
bwa copied to clipboard