strobealign accuracy compare to bwa for competitive mapping

accuracy compare to bwa for competitive mapping

Open jianshu93 opened this issue 2 years ago • 4 comments

Hello Kristoffer,

There is a situation where reads are mapped to very close references (e.g., reference sequences are 90% or more similar to each other). For example, a bacterial community (mixed species population), where multiple species population genomes can be sequenced. We want to quantity the abundance of each species in the community. So a competitive reads mapping will be necessary because there are cases where some regions of the genomes of difference bacterial species are very conservative, like 16S gene et.al. So the mapper will have a similar target sequences when mapping to a mix species reference, like 2 target are 98% similar to each other, reads mapped to those 2 targets are very similar, like one is 99% and the other is 98.8% et.al. For those cases, strobealign should be able to select the highest identity target as the only mapped reference while drop the less similar target (98.8%) so that we can accurately quantity the abundance of each species population genome in the mixed community. I did a comparison of StobeAlign with bwa: coverm_TAD80.StobeAlign.txt coverm_TAD80.bwa-mem2.txt

To do this we mapped reads to all the genomes obtained from this sample, pool all individual genomes into one file but label it. After done with mapping, extract records belong to the same genome and calculate the average coverage depth across the entire genome (some position may be low in coverage while others may be high in coverage). Those were done using the CoverM software genome command. Actually those genomes are binned from a large collection of contigs assembled from short reads of the sample according to the coverage of each contig and the kmer composition. So each genome represents a small collection of contigs that are very similar in coverage depth and kmer composition (see attached pdf figure). T4Aeroil_binning_manual_2.pdf.zip

I think this is a very good way to evaluate the accuracy of StrobeAlign compare to bwa et.al. As you can see in the txt tile, the overall coverage depth (Trimmed mean) for each genome are very similar for bwa and StrobeAlign. But MetaBAT.019_rename is high than MaxBin.051_rename for StrobeAlign but not the same for bwa. Even the difference is not that large but I am still wondering how much this will change the final conclusion we have when compare coverage of multiple species genomes in the same sample. I have already share the reads and reference with you last time (split reference by id, you will have the 15 genomes you see in the txt file). Does this worth some investigation? StrobeAlign is faster than bwa for sure.

Many,

Thanks,

Jianshu

Dec 28 '21 21:12 jianshu93

I think this would be a great evaluation. I will get back to you in a couple of days about this when I get into the details. I want to finish the next release first.

Dec 28 '21 22:12 ksahlin

Hi Jianshu,

How many reverence sequences do you have? Is it common to have more than 65k conigs in your data/metagenomics in general?

I've been optimizing accuracy, memory, and speed further in the last couple of days and I now use a 2byte integer for ref_id, which means it can now only store up to 65,536 references. I can always revert the implementation to use 4byte integers again.

Also, about the proposed evaluation, I saw that I only got contigs for a subset of the reads (contig-120). I'm not familiar with the tools you use such as CoverM. But analysis seems good.

Best, K

Jan 12 '22 16:01 ksahlin

Hello Kristoffer,

Yes, there can be so many reference sequences. When we do sequencing of a lot of cells, like so many species in fresh water samples, coverage will always be far away from enough, so a large fraction of cells are not fully covered by reads, so cannot be assembled, this is where we see gaps, thus fragmented genomes. For human genomes, there is alway the same cell, a lot of copies of that cell, so coverage can be very high, the entire genome can be assembled then without gap. 65,536 seems enough, but in some rare cases like very diverse samples, where you have a lot of bacterial species, this will not be enough. I am wondering how bwa/minimap2 deal with this situation, I can successfully use whatever number of number sequences with them. minimap2 sse2neon fold could be very helpful for you to add support for StrobeAlign (minimap2 ran perfectly on ARM right now). CoverM is very simple, filter reads mapped using NM tag, like high than 95% identity, then count all the filtered reads mapped to one genome (the same label sequences before _ in reference) by each position, divided by all positions (genome size), this is the coverage depth of that genome. We can compare this value for different bacterial species then. Thanks,

Jianshu

Jan 13 '22 04:01 jianshu93

Ok, thanks for your reply. I will later today push a version 0.3 which can still cope with billions of reference sequences (2^32), and then solve the above problem in the next version through some better bit packing. The next version will get 24 bits for references (2^(24) > 16M references) and 8 bits for the strobe offset. I hope there is not more than 16M contigs.

Jan 13 '22 09:01 ksahlin

As you can see in the txt tile, the overall coverage depth (Trimmed mean) for each genome are very similar for bwa and StrobeAlign. But MetaBAT.019_rename is high than MaxBin.051_rename for StrobeAlign but not the same for bwa. Even the difference is not that large but I am still wondering how much this will change the final conclusion we have when compare coverage of multiple species genomes in the same sample.

It is possible that the single-case coverage difference you observed between BWA-MEM and strobealign in your previous plot could have been resolved in the new version of strobealign (v0.8.0). We changed an internal representation so that reads mapping to perfect repeats are more randomly distributed. This could improve coverage estimates (if your coverage estimator does not only use reads with a unique best alignment).

(v0.8.0 also comes with many other improvements in memory usage, speed, increased mapping rate for very short reads 50-75nt, and various minor bugfixes).

Feb 04 '23 11:02 ksahlin

Thanks for the response. This is very helpful. I can see that strobealign is using Wave front for base accurate alignment，this is not used by bwa and bowtie2. I think it would be even accurate for mapping to very similar references，like 99.9% similar references.

Thanks！

Jianshu

Feb 04 '23 15:02 jianshu93

Release 0.8.0 is not using WFA but next release will (..if it works at least as well as current method and provides a speedup).

Feb 05 '23 07:02 ksahlin

Just an update that this is not yet fixed. It is our top priority and we will be working on it next.

Also, this is not an accuracy issue, but (most likely) a uniformly-at-random read assignment issue: in case reads have multiple best map locations, strobealign needs to be better at assigning the reads uniformly among the locations.

Nov 08 '23 19:11 ksahlin

Although the issue is more a question on evaluation, the discrepancy presented in the issue is most likely fixed in commit 6bbc5b7 due to better placements of repetitive reads.

If not fixed, it could be something with accuracy but that can be discussed in a separate and more targeted issue.

Nov 22 '23 16:11 ksahlin

Hello @ksahlin,

I will test again soon using the newest commit. If it is very close to bwa/bowtie2, then I will suggest the entire metagenomics community to StrobeAlign.

Thanks,

Jianshu

Nov 23 '23 03:11 jianshu93

Sounds great. I remember in your dataset there was some(/many?) reads at length 50nt, I don't think strobealign's accuracy will be better than BWA/Bowtie2 for those read lengths for a while. But great if you can rerun the analysis to see if it changes with the latest commit!

Nov 23 '23 10:11 ksahlin

strobealign strobealign copied to clipboard

accuracy compare to bwa for competitive mapping

strobealign
strobealign copied to clipboard