bwa
bwa copied to clipboard
Multithread support for bwa index
Hi all,
Current databases are becoming increasingly large. Recently I've found myself indexing a large FASTA file and taking over 200CPU hours (single thread).
Searching for multithreaded support for bwa index
I've landed on a 5 year old mailing-list thread that mentions the existence of some sort of patch. I couldn't find any reference to this patch though.
Regardless, is there any ongoing or planned work to make bwa index
parallelizable in some form?
No, there is no pull request on multi-threaded indexing. Implementing one may take quite some time but might not dramatically improve the performance, especially when you try to build the index within limited space.
Generally, to build a large index, you may consider to use a large block size (option "-b"). This option defaults to 10,000,000. You may increase it to 100,000,000 or even larger, depending on your input. This may save you some time.
@lh3 Thanks, increasing -b
does seem to improve speed considerably.
However I don't quite understand the impact of changing this option. At least during indexing, I don't see any significant memory increase even with values as large as 10,000,000,000.
What's the trade-off or otherwise, why isn't the default value larger?
-b specifies how many bases to process in a batch. The memory used by one batch is 8*{-b}. If you have a "reference genome" larger than 200Gb, you won't observe obvious memory increase with -b set to 10G. For a 3Gb human genome, setting -b to 10G will make the peak RAM 8 times as high at the BWT construction phase.
So if I understand correctly, the ideal -b
value is around # of bases / 8
.
Wouldn't it be possible to have this value adjusted automatically?
From what I gather, there's a first pass that packs the FASTA file. Is the -b
value already used at this stage? If not, could this stage be used to calculate the ideal -b
value?
On the other hand, if finding the ideal -b
during the "pack" phase is impractical, would it be reasonable to have:
-
-b
set to "auto" by default - if
-b
is set to "auto" perform a full file scan to calculate the ideal-b
. - if
-b
is set to anything but "auto", skip the full file scan and use the given value.
-b is only used when bwa generate "ref.fa.bwt". At that step, bwa index already knows the total length of the reference. -b was added when I wanted to index nt. I have only done that once, so did not bother to explore the optimal -b in general. Yes, it should be possible to automatically adjust -b, but before that I need to do some experiment to see how speed is affected by -b. Thanks for the suggestion anyway.
From the tests I've been running, changing the -b
value from the default of 10,000,000 to 500,000,000 to index a ~90Gb fasta file made the entire process roughly 6 times faster.
I'm now also giving it a try with a value of 20,000,000,000 computed by dividing the value of textLength
by 8. If this scales well, I expect a gain of at least 8 times.
Thanks for the data. 6 times is a lot, much larger than my initial guess. I will consider to automatically adjust -b in a future version of bwa.
Hello! Any news in this stream?
Thanks for the data. 6 times is a lot, much larger than my initial guess. I will consider to automatically adjust -b in a future version of bwa.
Hi, I hope everyone is OK in this thread. I am working with large fasta files and I am wondering if this feature is implemented in the current version? Or will it be implemented any time soon? Or should I continue optimising it? best wishes
Hi! I am also curious to know if anything changed since this thread was started. Cheers
Hi! I will be very happy to see any news in this threads! I am just dreaming about the threads option! It would be great! Cheers!