strobealign
strobealign copied to clipboard
Use ksw2 to align soft-clipped ends from ungapped alignment
This should fix some unexpected alignments as reported in #357.
When extending a seed, we currently compute an ungapped alignment allowing softclipping and report this unchanged if the hamming distance (counting each soft-clipped base as a mismatch) is low enough.
This is problematic because there might exist a gapped alignment of the soft-clipped ends that results in an overall higher score.
This PR changes the behavior so that after computing the ungapped alignment, ksw2 is used to align each end that has been soft clipped.
- The first commits that update the ksw2 source and add an
Aligner::ksw_extend
function were taken from the abandonedwfa2
branch. - The ksw2 functions have only very little documentation, so some experimentation was needed to understand how to correctly call them. I needed to make some fixes on top of the code in the wfa2 branch. For example, some example code for ksw2 uses a score of 0 for comparison of wildcard characters to regular nucleotides. This may make sense, but for compatibility with SSW, I changed this to be the same as a normal mismatch.
- The actual behavior change is in 85030cfa81ffebfd21dadc339b4dff72176d91c0.
I changed two reads in the phiX test dataset so that they would be soft-clipped with the old code but now result in a non-soft-clipped alignment with a single deletion, similar to the problematic read in #357.
Accuracy changes minimally (+0.006); mapping rate is unaffected. Runtime is hard to measure, maybe it’s 1-2% slower.
The addition of ksw_extend also enables addressing #377, but this is left for a separate PR.
Closes #357
CC @tprodanov
Excellent! Since it was a while ago I did a benchmark, I decided to run f4a8dd2 (named _kws2
) against v0.12.0 (commit 6fd4c5d).
- Minimal difference in runtime.
- The improved accuracy on 50nt reads likely comes from the parameter optimisation in a previous PR.
accuracy_plot.pdf time_plot.pdf memory_plot.pdf
I have also started a benchmark on biological data with SNP/indel calling, which may be more relevant. Will report back here.
Results for the variant analysis below.
- BIO datasets SNV: Last versions (v0.12.0 and ksw2) are approaching BWA-MEM's values (increased recall and lower precision over previous strobealign versions)
- BIO datasets INDEL: The ksw2 version has higher recall but lower precision compared to all previous versions. Importantly, it approaches or supersedes the values of BWA-MEM.
- The results for the simulated datasets (SIM) look quite similar or slightly worse compared to previous versions, but again, moving towards BWA-MEM values.
I have a hard time forming a consensus, but overall I think it looks good. I guess you verified it also had intended behaviour in the bug reported.
I approve a merge.
Adding the time and memory usage here as well b/t v0.7.1, v0.12.0 and this branch for reference. Runtime looks mostly unchanged between v0.12.0 and this branch except for SIM150.
It's one run per experiment though, but the node I run on is cleared from any other interfering processes.
I had to type this off from the screenshot, so I may have made a mistake, but are you sure this PR is better than 0.12.0?
what | BWA-MEM | strobealign 0.12 | extend-seed | diff |
---|---|---|---|---|
recall SIM150 SNV | 97.9 | 97.6 | 97.6 | 0 |
recall SIM150 INDEL | 62.2 | 67.3 | 67.0 | -0.3 |
recall SIM250 SNV | 98.9 | 98.6 | 98.6 | 0 |
recall SIM250 INDEL | 68.4 | 68.5 | 68.5 | 0 |
recall BIO150 SNV | 98.9 | 98.4 | 98.4 | 0 |
recall BIO150 INDEL | 90.1 | 89.7 | 90.2 | +0.5 |
recall BIO250 SNV | 97.7 | 97.5 | 97.5 | 0 |
recall BIO250 INDEL | 84.1 | 83.0 | 84.1 | +1.1 |
precision SIM150 SNV | 99.6 | 99.4 | 99.4 | 0 |
precision SIM150 INDEL | 72.6 | 72.7 | 72.2 | -0.5 |
precision SIM250 SNV | 99.5 | 99.3 | 99.3 | 0 |
precision SIM250 INDEL | 72.8 | 73.0 | 72.8 | -0.2 |
precision BIO150 SNV | 80.5 | 82.5 | 82.5 | 0 |
precision BIO150 INDEL | 62.5 | 63.0 | 62.0 | -1.0 |
precision BIO250 SNV | 76.7 | 77.3 | 77.1 | -0.2 |
precision BIO250 INDEL | 67.3 | 66.9 | 65.3 | -1.6 |
Perhaps you are right. I may have made this conclusion based on:
- I don't remember the exact reason, but I don’t trust the INDEL analysis a 100% based on what we discovered in an earlier analysis (something with how overlaps and TP are computed).
- We move towards BWA-MEMs higher recall lower precision values.
Those are very vague motivations. I guess the best way is to check that the alignment score is always better with the new version. Did you perform such a check on the simulated data?
I wanted to look at differing alignments for the analysis I did above, but I didn’t find a quick way to do a diff
comparison between the versions (because large sorted BAM files). Would you have time to do some sanity check of differing alignments, or better, check the alignment scores?
For the analysis I did, the BAM files for the four datasets are here /proj/snic2022-6-31/nobackup/strobealign_eval/alignments_PE/strobealign_v0120_cmake/*.bam
and /proj/snic2022-6-31/nobackup/strobealign_eval/alignments_PE/strobealign_ksw2_cmake/*.bam
For the two simulated files, the true alignments are at /proj/snic2022-6-31/nobackup/data/HG004/*.sam
{SIM150,SIM250} which enables computing score accuracy.
To answer a couple of questions:
I guess you verified it also had intended behaviour in the bug reported.
Current strobealign reports 143=7S
for the alignment in the bug report. This PR makes it report 143=1I6=
while BWA-MEM reports 140=1I9=
. These two have the same score of course, but the difference is that the latter is left aligned. It would be nice to also report left-aligned alignments, but let’s consider that a separate issue (it should be possible to postprocess the alignment appropriately).
Runtime looks mostly unchanged between v0.12.0 and this branch except for SIM150.
Good, let’s hope that the slowdown on SIM150 is just an outlier. Or otherwise we’ll just have to accept the tradeoff and get a speedup some other way.
I guess the best way is to check that the alignment score is always better with the new version. Did you perform such a check on the simulated data?
Yes, here’s the full table:
Comparing Score-based accuracy
7b3b260 Merge pull request #379 from ksahlin/optimize-kslu2 85030cf If ungapped alignment is softclipped, ksw_extend soft-clipped ends
dataset | 7b3b260 | 85030cf | difference |
---|---|---|---|
ecoli50-150-se | 99.3834 | 99.4162 | +0.0328 |
drosophila-50-se | 96.3365 | 96.3364 | -0.0001 |
drosophila-75-se | 99.3890 | 99.3884 | -0.0006 |
drosophila-100-se | 99.3757 | 99.3761 | +0.0004 |
drosophila-150-se | 99.6204 | 99.6253 | +0.0049 |
drosophila-200-se | 99.6517 | 99.6578 | +0.0061 |
drosophila-300-se | 99.6218 | 99.6322 | +0.0104 |
drosophila-500-se | 99.4878 | 99.4992 | +0.0114 |
maize-50-se | 91.4195 | 91.4190 | -0.0005 |
maize-75-se | 94.3089 | 94.3101 | +0.0012 |
maize-100-se | 95.6757 | 95.6729 | -0.0028 |
maize-150-se | 97.2738 | 97.2835 | +0.0097 |
maize-200-se | 98.0256 | 98.0373 | +0.0117 |
maize-300-se | 98.4677 | 98.4814 | +0.0137 |
maize-500-se | 98.8132 | 98.8325 | +0.0193 |
CHM13-50-se | 93.1704 | 93.1708 | +0.0004 |
CHM13-75-se | 96.8539 | 96.8518 | -0.0021 |
CHM13-100-se | 97.8192 | 97.8209 | +0.0017 |
CHM13-150-se | 98.5407 | 98.5445 | +0.0038 |
CHM13-200-se | 98.6710 | 98.6782 | +0.0072 |
CHM13-300-se | 98.5879 | 98.5947 | +0.0068 |
CHM13-500-se | 98.5663 | 98.5738 | +0.0075 |
rye-50-se | 88.6941 | 88.6898 | -0.0043 |
rye-75-se | 91.8481 | 91.8486 | +0.0005 |
rye-100-se | 94.0778 | 94.0791 | +0.0013 |
rye-150-se | 96.3543 | 96.3630 | +0.0087 |
rye-200-se | 97.3639 | 97.3727 | +0.0088 |
rye-300-se | 98.0025 | 98.0182 | +0.0157 |
rye-500-se | 98.7715 | 98.7956 | +0.0241 |
ecoli50-150-pe | 99.5890 | 99.6164 | +0.0273 |
drosophila-50-pe | 99.5883 | 99.5885 | +0.0002 |
drosophila-75-pe | 99.7696 | 99.7697 | +0.0002 |
drosophila-100-pe | 99.7669 | 99.7676 | +0.0007 |
drosophila-150-pe | 99.7475 | 99.7512 | +0.0037 |
drosophila-200-pe | 99.7119 | 99.7160 | +0.0041 |
drosophila-300-pe | 99.6593 | 99.6668 | +0.0076 |
drosophila-500-pe | 99.5292 | 99.5388 | +0.0096 |
maize-50-pe | 96.5466 | 96.5461 | -0.0004 |
maize-75-pe | 97.9274 | 97.9272 | -0.0001 |
maize-100-pe | 98.6033 | 98.6026 | -0.0008 |
maize-150-pe | 99.1774 | 99.1813 | +0.0039 |
maize-200-pe | 99.3377 | 99.3441 | +0.0064 |
maize-300-pe | 99.4756 | 99.4829 | +0.0073 |
maize-500-pe | 99.3684 | 99.3814 | +0.0131 |
CHM13-50-pe | 98.2363 | 98.2363 | +0.0000 |
CHM13-75-pe | 98.7770 | 98.7772 | +0.0002 |
CHM13-100-pe | 99.0379 | 99.0382 | +0.0004 |
CHM13-150-pe | 99.2155 | 99.2174 | +0.0019 |
CHM13-200-pe | 99.1885 | 99.1927 | +0.0041 |
CHM13-300-pe | 99.1850 | 99.1903 | +0.0054 |
CHM13-500-pe | 99.0001 | 99.0100 | +0.0099 |
rye-50-pe | 94.9689 | 94.9691 | +0.0002 |
rye-75-pe | 96.8398 | 96.8385 | -0.0012 |
rye-100-pe | 97.9214 | 97.9195 | -0.0020 |
rye-150-pe | 98.8289 | 98.8330 | +0.0041 |
rye-200-pe | 99.1062 | 99.1108 | +0.0045 |
rye-300-pe | 99.4078 | 99.4156 | +0.0079 |
rye-500-pe | 99.3743 | 99.3908 | +0.0165 |
Average difference se: +0.0068 Average difference pe: +0.0046
I had convinced myself that it is ok that the score gets worse in some cases, but now I’m not so sure anymore. I’ll have to look into this.
Would you have time to do some sanity check of differing alignments, or better, check the alignment scores?
I don’t have so much time left before going on vacation. I will not be able to look at your files, but may have time for some other checks.
These two have the same score of course, but the difference is that the latter is left aligned.
I see. Not consistently left/right aligning indels could possibly explain the lower precision. Nice! I think we even saw it in a previous evaluation using WFA2, IIRC.
I had convinced myself that it is ok that the score gets worse in some cases, but now I’m not so sure anymore. I’ll have to look into this.
I had expected that we would never end up with lower score. But I realize I don't fully understand the consequences of the new approach. If we did a 'full' (semi-global) realignment before, compared to only extending the softclipped ends in this branch, maybe the sum of the scores of the aligned 'pieces' is lower than the semi-global score?
Related: When you extract the softclipped ends to extend, I see you only take the softclipped part:
const std::string query_right = query.substr(info.query_end);
perhaps there is a benefit of taking an overlapping part of the NAM, such as k
or k/2
? I understand the logic of the surrounding code and the CIGAR might become more complicated. But do you think this could be a reason for the change in score?